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Abstract 


Rating  scales  provide  an  efficient  and  widely  used  means  of  record- 
ing judgments.  This  paper  reviews  scaling  issues  within  the  context 
of  a  psychometric  model  of  the  rating  process  and  describes  several 
methods  of  scaling  rating  data.  The  scaling  procedures  include  the 
simple  mean,  standardized  values,  scale  values  based  on  Thurstone's 
Law  of  Categorical  Judgment,  and  regression-based  values.  The  scal- 
ing methods  are  compared  in  terms  of  the  assumptions  they  require 
about  the  rating  process  and  the  information  they  provide  about  the 
underlying  psychological  dimension  being  assessed. 
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Scaling  of  Ratings:  Concepts  and  Methods 

Thomas  C.  Brown  and  Terry  C.  Daniel 


INTRODUCTION 

Rating  scales  offer  an  efficient  and  widely  used  means 
of  recording  judgments  about  many  kinds  of  stimuli. 
Such  scales  are  often  used  in  studies  relating  to  natural 
resources  management,  for  example,  to  measure  citizen 
oreferences  for  recreation  activities  (Driver  and  Knopf 
1977)  or  perceived  scenic  beauty  of  forest  scenes  (Brown 
and  Daniel  1986).  In  this  paper  we  review  issues  regard- 
ing the  use  of  rating  data,  and  describe  and  compare 
methods  for  scaling  such  data. 

This  paper  provides  theoretical  and  descriptive  back- 
ground for  scaling  procedures  available  in  a  computer 
program  called  RMRATE,  which  is  described  in  a  com- 
panion document  (Brown  et  al.  1990).  RMRATE  is 
designed  to  (1)  scale  rating  data  using  a  battery  of  scaling 
procedures,  (2)  compare  the  scale  values  obtained  by  use 
of  these  procedures,  (3)  evaluate  to  a  limited  extent 
whether  the  assumptions  of  the  scaling  procedures  are 
tenable,  (4)  determine  the  reliability  of  the  ratings,  and 
(5)  evaluate  individual  variations  among  raters. 

Both  this  paper  and  the  RMRATE  computer  program 
are  outgrowths  of  an  effort  that  began  in  the  early  1970s 
to  better  understand  the  effects  of  management  on  the 
scenic  beauty  of  forest  environments.  An  important 
report  by  Daniel  and  Boster  (1976)  introduced  the  Scenic 
Beauty  Estimation  (SBE)  method.  The  SBE  method  is 
reviewed  and  further  developed  herein,  along  with  other 
scaling  procedures,  including  median  and  mean  ratings, 
standardized  scores,  and  a  new  scale  based  on  a  least 
squares  analysis  of  the  ratings. 

While  scenic  beauty  has  been  the  focus  of  the  work 
that  led  up  to  this  paper,  and  continues  to  be  a  major 
research  emphasis  of  the  authors,  the  utility  of  the 
scaling  procedures  is  certainly  not  limited  to  measure- 
ment of  scenic  beauty.  Rather,  this  paper  should  be  of 
interest  to  anyone  planning  to  obtain  or  needing  to  ana- 
lyze ratings,  no  matter  what  the  stimuli. 

Psychological  scaling  procedures  are  designed  to  deal 
with  the  quite  likely  possibility  that  people  will  use  the 
rating  scale  differently  from  one  to  another  in  the  process 
of  recording  their  perceptions  of  the  stimuli  presented 
for  assessment.  Scaling  procedures  can  be  very  effective 
in  adjusting  for  some  of  these  differences,  but  the  proce- 
dures cannot  correct  for  basic  flaws  in  experimental 
design  that  are  also  reflected  in  the  ratings.  While 
aspects  of  experimental  design  are  mentioned  through- 
out this  paper,  we  will  not  cover  experimental  design 
in  detail;  the  reader  desiring  an  explicit  treatment  of  ex- 
perimental design  should  consult  a  basic  text  on  the 
topic,  such  as  Cochran  and  Cox  (1957)  or  Campbell  and 
Stanley  (1963). 

We  first  offer  a  brief  introduction  to  psychological 
scaling  to  refresh  the  reader's  memory  and  set  the  stage 
for  what  follows.  Readers  with  no  prior  knowledge  of 
scaling  methods  should  consult  a  basic  text  on  the  sub- 


ject, such  as  Nunnally  (1978)  or  Torgerson  (1958).  We 
then  describe  and  compare  several  procedures  for  scaling 
rating  data.  Finally,  additional  comparisons  of  the 
scaling  procedures  are  found  in  the  appendix. 


PSYCHOLOGICAL  SCALING 

Psychometricians  and  psychophysicists  have  devel- 
oped scaling  procedures  for  assigning  numbers  to  the 
psychological  properties  of  persons  and  objects.  Psy- 
chometricians have  traditionally  concentrated  on  de- 
veloping measures  of  psychological  characteristics  or 
traits  of  persons,  such  as  the  IQ  measure  of  intelligence. 
Psychophysics  is  concerned  with  obtaining  systematic 
measures  of  psychological  response  to  physical  proper- 
ties of  objects  or  environments.  A  classic  example  of  a 
psychophysical  scale  is  the  decibel  scale  of  perceived 
loudness. 

Among  the  areas  of  study  to  which  psychophysical 
methods  have  been  applied,  and  one  that  is  a  primary 
area  of  application  for  RMRATE  (Brown  et  al.  1990),  is 
the  scaling  of  perceived  environmental  quality  and 
preferences.  In  this  context,  scaling  methods  are  applied 
to  measure  differences  among  environmental  settings  on 
psychological  dimensions  such  as  esthetic  quality, 
scenic  beauty,  perceived  naturalness,  recreational  qual- 
ity, or  preference. 


Scaling  Levels 

An  important  consideration  in  psychological  scaling, 
as  in  all  measurement,  is  the  "level"  of  the  scale  that 
is  achieved.  Classically  there  are  three  levels  that  are  dis- 
tinguished by  the  relationship  between  the  numbers  de- 
rived by  the  scale  and  the  underlying  property  of  the 
objects  (or  persons)  that  are  being  measured.  The  lowest 
level  of  measurement  we  will  discuss  is  the  ordinal  level, 
where  objects  are  simply  ranked,  as  from  low  to  high, 
with  respect  to  the  underlying  property  of  interest.  At 
this  level,  a  higher  number  on  the  scale  implies  a  higher 
degree  (greater  amount)  of  the  property  measured,  but 
the  magnitude  of  the  differences  between  objects  is  not 
determined.  Thus,  a  rank  of  3  is  below  that  of  4,  and 
4  is  below  6,  but  the  scale  does  not  provide  information 
as  to  whether  the  object  at  rank  4  differs  more  from  the 
object  at  3  or  from  the  object  ranked  at  6.  At  this  level 
of  measurement  only  statements  of  "less  than,"  "equal 
to,"  or  "greater  than,"  with  respect  to  the  underlying 
property,  can  be  supported. 

Most  psychological  scaling  methods  seek  to  achieve 
an  interval  level  of  measurement,  where  the  magnitude 
of  the  difference  between  scale  values  indicates,  for  ex- 
ample, the  extent  to  which  one  object  is  preferred  over 
another.  The  intervals  of  this  metric  are  comparable  over 
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the  range  of  the  scale;  e.g.,  the  difference  between  scale 
values  of  1  and  5  is  equivalent  to  the  difference  between 
11  and  15  with  respect  to  the  underlying  property.  In- 
terval scale  metrics  have  an  arbitrary  zero  point,  or  a  "ra- 
tional" origin  (such  as  the  Celsius  scale  of  temperature 
where  0  degrees  is  defined  by  the  freezing  point  of 
water).  They  do  not,  however,  have  a  true  zero  point  that 
indicates  the  complete  absence  of  the  property  being 
measured. 

Interval  scales  will  support  mathematical  statements 
about  the  magnitude  of  differences  between  objects  with 
respect  to  the  property  being  measured.  For  example, 
a  statement  such  as  '  'a  difference  of  4  units  on  the  meas- 
urement scale  represents  twice  as  great  a  difference  in 
the  underlying  property  as  a  difference  of  2  units"  could 
be  made  about  information  in  an  interval  scale.  It  would 
not  be  permissible,  however,  to  state  that  "the  object 
with  a  value  of  4  has  twice  as  much  of  the  property  being 
measured  as  the  object  scaled  at  2."  The  latter  statement 
requires  a  higher  level  of  measurement,  one  where  all 
scale  values  are  referenced  to  an  "absolute  zero." 

The  highest  level  of  measurement  is  the  ratio  scale, 
where  the  ratios  of  differences  are  equal  over  the  range 
of  the  scale;  e.g.,  a  scale  value  of  1  is  to  2  as  10  is  to 
20.  Ratio  scales  require  a  "true  zero"  or  "absolute" 
origin,  where  0  on  the  scale  represents  the  complete  ab- 
sence of  the  property  being  measured  (such  as  the  Kel- 
vin scale  of  temperature,  where  0  represents  the 
complete  absence  of  heat).  Generally,  ratio  scales  are 
only  achieved  in  basic  physical  measurement  systems, 
such  as  length  and  weight.  Absolute  zeros  are  much 
harder  to  define  in  psychological  measurement  systems, 
because  of  the  difficulty  of  determining  what  would  con- 
stitute the  absolute  absence  of  characteristics  such  as  in- 
telligence or  preference. 

It  is  important  to  note  that  the  ordinal,  interval,  or  ra- 
tio property  of  a  measurement  scale  is  determined  with 
reference  to  the  underlying  dimension  being  measured; 
20  degrees  Celsius  is  certainly  twice  as  many  degrees  as 
10,  but  it  does  not  necessarily  represent  twice  as  much 
heat. 

The  level  of  measurement  may  place  restrictions  on 
the  validity  of  inferences  that  can  be  drawn  about  the 
underlying  property  being  measured  based  on  operations 
performed  on  the  scale  values  (the  numbers).  Some  fre- 
quently used  mathematical  operations,  such  as  the  com- 
putation and  comparison  of  averages,  require 
assumptions  that  are  not  met  by  some  measurement 
scales.  In  particular,  if  the  average  of  scale  values  is  to 
represent  an  average  of  the  underlying  property,  then 
the  measurement  scale  must  be  at  least  at  the  interval 
level,  where  equal  distances  on  the  measurement  scale 
indicate  equal  differences  in  the  underlying  property. 
Similarly,  if  ratios  of  scale  values  are  computed,  only 
a  ratio  scale  will  reflect  equivalent  ratios  in  the  under- 
lying property. 

Scaling  Methods 

A  number  of  different  methods  can  be  used  for  psy- 
chological scaling.  All  methods  involve  the  presentation 


of  objects  to  observers  who  must  give  some  overt  indi- 
cation of  the  relative  position  of  the  objects  on  some 
designated  psychological  dimension  (e.g.,  perceived 
weight,  brightness,  or  preference).  Traditional  methods 
for  obtaining  reactions  to  the  objects  in  a  scaling  experi- 
ment include  paired-comparisons,  rank  orderings,  and 
numerical  ratings. 

Perhaps  the  simplest  psychophysical  measurement 
method  conceptually  is  the  method  of  paired- 
comparisons.  Objects  are  presented  to  observers  two  at 
a  time,  and  the  observer  is  required  to  indicate  which 
has  the  higher  value  on  the  underlying  scale;  e.g.,  in  the 
case  of  preferences,  the  observer  indicates  which  of  the 
two  is  most  preferred.  A  related  procedure  is  the  rank- 
order  procedure.  Here  the  observer  places  a  relatively 
small  set  of  objects  (rarely  more  than  10)  in  order  from 
lowest  (least  preferred)  to  highest  (most  preferred).  At 
their  most  basic  level,  these  two  procedures  produce  or- 
dinal data,  based  on  the  proportion  of  times  each  stimu- 
lus is  preferred  in  the  paired-comparison  case,  and  on 
the  assigned  ranks  in  the  rank-ordering  procedure. 

One  of  the  most  popular  methods  for  obtaining  reac- 
tions from  observers  in  a  psychological  measurement 
context  uses  rating  scales.  The  procedure  requires  ob- 
servers to  assign  ratings  to  objects  to  indicate  their  atti- 
tude about  some  statement  or  object,  or  their  perception 
of  some  property  of  the  object. 

In  each  of  these  methods,  the  overt  responses  of  the 
observers  (choices,  ranks,  or  ratings)  are  not  taken  as 
direct  measures  of  the  psychological  scale  values,  but 
are  used  as  indicators  from  which  estimates  of  the  psy- 
chological scale  are  derived  using  mathematical  proce- 
dures appropriate  to  the  method.  In  theory,  the 
psychological  scale  values  derived  for  a  set  of  objects 
should  not  differ  between  different  scaling  melhods.  For 
example,  if  a  paired-comparison  procedure  and  a  rating 
scale  are  used  for  indicating  relative  preferences  for  a 
common  set  of  objects,  the  psychological  preference 
scale  values  for  the  objects  should  be  the  same,  or  within 
a  linear  transformation. 

While  the  basic  data  from  the  paired-comparison  and 
rank-order  procedures  are  originally  at  the  ordinal  level 
of  measurement,  psychometric  scaling  procedures  have 
been  developed  that,  given  certain  theoretical  assump- 
tions, provide  interval  level  measures.  Perhaps  the  best 
known  procedures  are  those  developed  by  Thurstone 
(see  Nunnally  (1978)  and  Torgerson  (1958)),  whereby 
choices  or  ranks  provided  by  a  number  of  observers  (or 
by  one  observer  on  repeated  occasions)  are  aggregated 
to  obtain  percentiles,  which  are  then  referenced  to  a 
normal  distribution  to  produce  interval  scale  values  for 
the  objects  being  judged.  A  related  set  of  methods, 
also  based  on  normal  distribution  assumptions,  was 
developed  for  rating  scale  data.  Later  sections  of  this 
paper  describe  and  compare  procedures  used  with  rat- 
ing data.  Additional,  more  detailed  presentations  of  the 
theoretical  rationale  and  the  computational  procedures 
are  found  in  the  texts  by  authors  such  as  Torgerson 
(1958)  and  Nunnally  (1978).  Discussion  of  these  issues 
in  the  context  of  landscape  preference  assessment  can 
be  found  in  papers  by  Daniel  and  Boster  (1976),  Buhyoff 
et  al.  (1981),  and  Hull  et  al.  (1984). 
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Rating  Scales 


Rating  procedure 


Scaling  procedure 


Rating  response  scales  are  typically  used  in  one  of  two 
ways.  With  the  first  approach,  each  value  of  the  rating 
scale  can  carry  a  specific  descriptor.  This  procedure  is 
often  used  in  attitude  assessment.  For  example,  the 
values  of  a  5-point  scale  could  be  specified  as  (1)  com- 
pletely agree,  (2)  tend  to  agree,  (3)  indifferent,  (4)  tend 
to  disagree,  and  (5)  completely  disagree,  where  the  ob- 
server is  to  indicate  degree  of  agreement  about  a  set  of 
statements.  The  observer  chooses  the  number  of  the 
response  that  most  closely  represents  his/her  attitude 
about  each  statement.  With  the  second  use  of  rating 
scales,  only  the  end-points  of  the  scale  are  specified. 
This  format  is  commonly  used  with  environmental 
stimuli,  where  observers  are  required  to  assign  ratings 
to  stimuli  to  indicate  their  perception  of  some  property 
of  the  stimuli.  For  example,  a  10-point  rating  scale  might 
be  used,  with  a  "1"  on  the  scale  indicating  "very  low 
preference"  for  the  stimulus,  and  a  "10"  indicating 
"very  high  preference."  Ratings  between  1  and  10  are 
to  indicate  levels  of  preference  between  the  two  ex- 
tremes. The  end-points  are  specified  to  indicate  the 
direction  of  the  scale  (e.g.,  low  ratings  for  less  prefer- 
ence, high  ratings  for  more  preference). 

Whether  associated  with  a  specific  descriptor  or  not, 
an  individual  rating,  by  itself,  cannot  be  taken  as  an  in- 
dicator of  any  particular  (absolute)  value  on  the  under- 
lying scale.  For  example,  labeling  one  of  the  categories 
"strongly  agree"  in  no  way  assures  that  "strong"  agree- 
ment in  one  assessment  context  is  equivalent  to  "strong" 
agreement  in  another.  Similarly,  a  rating  of  "5"  by  it- 
self provides  no  information.  A  given  rating  provides 
useful  information  only  when  it  is  compared  with 
another  rating;  that  is,  there  is  meaning  only  in  the  rela- 
tionships among  ratings  as  indicators  of  the  property  be- 
ing assessed.  Thus,  it  is  informative  to  know  that  one 
stimulus  is  rated  a  5  when  a  second  stimulus  is  rated 
a  6.  Here  the  ratings  indicate  which  stimulus  is  per- 
ceived to  have  more  of  the  property  being  assessed. 
Furthermore,  if  a  third  stimulus  is  rated  an  8,  we  may 
have  information  not  only  about  the  ranking  of  the 
stimuli,  but  also  about  the  degree  to  which  the  stimuli 
are  perceived  to  differ  in  the  property  being  assessed. 

Ratings,  at  a  minimum,  provide  ordinal-level  infor- 
mation about  the  stimuli  on  the  underlying  dimension 
being  assessed.  However,  ratings  are  subject  to  several 
potential  "problems"  which,  to  the  extent  they  exist, 
tend  to  limit  the  degree  to  which  rating  data  provide  in- 
terval scale  information  and  the  degree  to  which  ratings 
of  different  observers  are  comparable.  Before  we  review 
some  of  these  problems,  it  will  be  useful  to  present  a 
model  of  the  process  by  which  ratings  are  formed  and 
scaled. 

Psychometric  Model 

The  objective  of  a  rating  exercise  is  to  obtain  a  numer- 
ical indication  of  observers'  perceptions  of  the  relative 
position  of  one  stimulus  versus  another  on  a  specified 
psychological  dimension  (e.g.,  scenic  beauty).  This 


Observer's  internal  process 


Stimuli  -^(perception  ^  "o^Sbn   -  Ratings  -  Mathematical  _  Scale 
\  scale         /  transformation  values 

Figure  1  .—Conceptual  model  of  the  rating  and  scaling  procedures. 

objective  is  approached  by  two  sequential  procedures 
(fig.  !)• 

The  rating  procedure  requires  that  observers  record 
their  ratings  of  the  stimuli  on  the  rating  response  scale 
provided.  Observers  are  presented  with  stimuli  and,  via 
an  internal  perceptual  and  cognitive  process,  produce 
overt  ratings.  Because  the  experimental  design  of  the 
rating  exercise  delineates  the  specific  characteristics  of 
this  procedure,  it  must  be  carefully  conceived  to  meet 
the  overall  objective  of  an  accurate  assessment  (vis-a-vis 
the  circumstances  to  which  the  results  are  to  be  gener- 
alized) of  the  perceived  values  of  the  stimuli.  The  end 
product  of  the  rating  procedure  is  a  matrix  of  ratings  by 
observers  of  stimuli.  The  rating  for  a  given  stimulus  de- 
pends upon  both  the  perceived  value  of  the  stimulus 
(e.g.,  perceived  scenic  beauty)  and  the  judgment  cri- 
terion scale  being  applied  (e.g.,  how  beautiful  a  scene 
must  be  perceived  to  be  to  merit  a  given  rating).  Thus, 
the  rating  recorded  by  an  observer  cannot  be  interpreted 
as  a  direct  indicator  of  the  perceived  value  for  that  stimu- 
lus. The  purpose  of  the  scaling  procedure  is  to  apply 
appropriate  mathematical  transformations  to  the  ratings 
so  as  to  produce  scale  values  for  the  stimuli.  These  scale 
values  are  intended  to  indicate  the  perceived  values  of 
the  stimuli,  or,  more  correctly,  the  relative  positions  of 
the  stimuli  on  the  psychological  dimension  being 
assessed. 

Within  the  rating  procedure,  a  distinction  is  made  be- 
tween observers'  perceptions  of  a  stimulus  and  their 
criteria  for  assigning  ratings  to  the  stimulus.  This  two- 
part  model  (Daniel  and  Boster  1976)  follows  the  psy- 
chophysical models  developed  by  Thurstone  (Torgerson 
1958),  as  extended  by  signal  detection  theory  (Green  and 
Swetts  1966).  In  simplified  terms,  the  model  postulates 
that  implicit  perceptual  processes  encode  the  features 
of  the  stimulus  and  translate  them  into  a  subjective  im- 
pression of  that  stimulus  for  the  dimension  being  judged 
(e.g.,  if  the  stimulus  is  an  outdoor  scene,  the  dimension 
could  be  scenic  beauty).  This  perceptual  process  is  in- 
fluenced by  the  features  of  the  stimulus  in  interaction 
with  the  sensory  and  perceptual  system  of  the  observer, 
and  may  involve  both  "cognitive"  and  "affective" 
processes  (Kaplan  1987,  Ulrich  1983,  Zajonc  1980).  The 
result  of  this  process  is  a  relative  impression  of  the 
stimulus — of  its  place  relative  to  other  possible  stimuli. 
To  produce  an  overt  rating,  the  perception  of  the  stimu- 
lus must  be  referenced  to  a  /udgment  criterion  scale.  The 
organization  of  that  scale  allows  the  perceived  value  of 
the  stimulus  to  be  expressed,  as  on  a  10-point  rating 
scale. ^ 

^Forced-choice  (e.g.,  paired-comparison)  and  rank-order  procedures 
avoid  the  criterion  component;  in  these  procedures,  the  observer's 
response  is  only  dependent  on  the  relative  perceived  value  of  each 
stimulus. 
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Figure  2  depicts  how  hypothetical  perceived  values 
for  each  of  three  stimuli  could  produce  overt  ratings  ac- 
cording to  four  different  observers'  judgment  criterion 
scales.  For  this  example  the  perceived  values  for  the 
three  stimuli  are  assumed  to  be  identical  for  all  four  ob- 
servers, and  are  indicated  by  the  three  horizontal  lines 
that  pass  from  the  "perceived  value"  axis  through  the 
four  different  judgment  criterion  scales.  When  referred 
to  the  judgment  criterion  scale  of  observer  A,  the  per- 
ceived value  of  stimulus  1  is  sufficient  to  meet  the  criteri- 
on for  the  eighth  category,  but  not  high  enough  to  reach 
the  ninth  category,  so  the  observer  w^ould  assign  a  rat- 
ing of  8  to  the  stimulus.  Similarly,  the  same  stimulus 
would  be  assigned  a  rating  of  10  according  to  observer 
C's  judgment  criterion  scale,  and  only  a  6  according  to 
observer  D's  judgment  criterion  scale. 

The  illustration  in  figure  2  begins  with  the  assump- 
tion that  the  four  observers  have  identical  perceptions 
of  the  stimuli,  but  different  judgment  criterion  scales. 
In  actual  applications,  of  course,  neither  the  perceived 
values  nor  the  criterion  scales  are  known;  only  the  overt 
ratings  are  available  for  analysis.  However,  guided  by 
a  psychometric  model,  scaling  procedures  derive  esti- 
mates of  differences  in  perceived  values  that  are  poten- 
tially independent  of  differences  in  judgment  criteria. 
Relationships  between  ratings  of  different  stimuli  by  the 
same  observer(s)  are  used  to  in/er  perceptions.  Given  the 
conditions  illustrated  in  figure  2,  where  only  observer 
rating  criteria  differ,  the  ideal  scaling  procedure  would 
translate  each  observer's  ratings  so  that  the  scale  values 
for  a  given  stimulus  would  be  identical  for  all  four 
observers. 

Problems  With  Interpreting  Rating  Scales 

Unequal-interval  judgment  criterion  scales. — The 

rating  scale  provides  an  opportunity  for  observers  to 
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Figure  2.— Judgment  criterion  scales  of  four  observers  with  identi- 
cal perceived  values. 


directly  indicate  magnitudes  of  differences  in  their  per- 
ceptions of  the  objects,  which  is  not  provided  by  either 
paired-comparison  or  rank-order  techniques.  However, 
for  this  to  occur,  the  intervals  between  rating  categories 
must  be  equal  with  regard  to  the  underlying  property 
being  measured.  Equally  spaced  intervals  would  require 
that,  for  example,  the  difference  in  the  dimension  be- 
ing rated  yielding  an  increase  in  rating  from  2  to  3  is 
equal  to  the  difference  in  that  dimension  yielding  an 
increase  in  rating  from  6  to  7.  The  criterion  scales  of 
observers  B,  C,  and  D  of  figure  2  are  equal-interval  scales, 
while  the  scale  of  observer  A  is  an  unequal-interval  scale. 

Unfortunately,  the  intervals  between  rating  categories 
on  the  underlying  psychological  dimension  will  not 
necessarily  be  equal.  An  obvious  potential  cause  of  un- 
equal intervals  in  people's  use  of  the  rating  scale  is  the 
"end-point"  problem.  This  problem  could  arise  when 
an  observer  encounters  a  stimulus  to  be  rated  that  does 
not  fit  within  the  rating  criteria  that  the  observer  has  es- 
tablished in  the  course  of  rating  previous  stimuli.  For 
example,  the  observer  may  encounter  a  stimulus  that 
he/she  perceives  to  have  considerably  less  of  the 
property  being  rated  than  a  previous  stimulus  that  was 
assigned  the  lowest  possible  rating.  This  new  stimulus 
will  also  be  assigned  the  lowest  possible  rating,  which 
may  result  in  a  greater  range  of  the  property  being  as- 
signed to  the  lowest  rating  category  than  to  other  rating 
categories.  This  may  occur  at  both  ends  of  the  rating 
scale,  resulting  in  a  sigmoid  type  relationship  between 
ratings  and  the  underlying  property  (Edwards  1957). 

The  end-point  problem  can  be  ameliorated  by  show- 
ing observers  a  set  of  "preview"  stimuli  that  depicts  the 
range  of  stimuli  subsequently  to  be  rated.  This  allows 
observers  to  set  ("anchor")  their  rating  criteria  to  encom- 
pass the  full  range  of  the  property  to  be  encountered  dur- 
ing the  rating  session.  Hull  et  al.  (1984)  used  this 
procedure  when  they  compared  rating  scale  values  to 
paired-comparison  scale  values  for  the  same  stimuli. 
Paired-comparisons,  of  course,  are  not  subject  to  an  end- 
point  constriction.  The  linear  relationship  they  found 
between  the  two  sets  of  scale  values  extended  to  the  ends 
of  the  scale,  suggesting  that  the  ratings  they  obtained 
did  not  suffer  from  the  end-point  problem. 

Of  course,  the  end-point  problem  is  not  the  only  poten- 
tial source  of  unequal-interval  ratings.  Observers  are  free 
to  adopt  any  standards  they  wish  for  assigning  their  rat- 
ings, and  there  is  no  a  priori  reason  to  expect  that  they 
will  use  equal  intervals.  For  example,  the  intervals  might 
gradually  get  larger  the  farther  they  are  from  the  center 
of  the  scale,  as  in  the  criterion  scale  of  observer  A  in 
figure  2. 

Because  it  is  not  possible  to  directly  test  for  equality 
of  intervals  among  an  observer's  ratings,  some  statisti- 
cians argue  that  ratings  should  not  be  used  as  if  they 
represent  interval  data  (e.g.,  Golbeck  1986).  Others, 
however,  argue,  based  on  Monte  Carlo  simulations  and 
other  approaches,  that  there  is  little  risk  in  applying 
parametric  statistics  to  rating  data,  especially  if  ratings 
from  a  sufficient  number  of  observers  are  being  com- 
bined (Baker  et  al.  1966,  Gregoire  and  Driver  1987, 
O'Brien  1979).  Nevertheless,  the  possibility  of  an 
unequal-interval  scale  leaves  the  level  of  measurement 
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achieved  by  rating  scales  somewhat  ambiguous.  The  crit- 
ical issue,  of  course,  is  how  the  ratings,  and  any  statis- 
tics or  indices  computed  from  those  ratings,  relate  to  the 
underlying  (psychological)  dimension  that  is  being  as- 
sessed. This  issue  can  only  be  addressed  in  the  context 
of  some  theory  or  psychometric  model  of  the  perceptu- 
al/judgmental process. 

Lack  of  interobserver  correspondence. — Individual 
observer's  ratings  frequently  do  not  agree  with  those  of 
other  observers  for  the  same  stimuli.  Lack  of  correspon- 
dence could  result  from  differences  in  perception,  which 
of  course  is  not  a  "problem"  at  all;  rather  it  is  simply 
a  finding  of  the  rating  exercise  at  hand.  Lack  of  cor- 
respondence could  also  result  from  poor  understanding 
of  the  rating  task,  poor  eyesight  or  other  sensory  mal- 
function, simple  observer  distraction,  or  even  intentional 
misrepresentation.  Principal  component  analysis  or 
cluster  analysis  techniques  may  be  useful  to  determine 
whether  observers  fall  into  distinct  groups  with  regard 
to  their  perception  of  the  stimuli,  or  whether  observers 
who  disagree  are  unique.  In  some  cases  it  may  be  appro- 
priate to  either  drop  some  observers  from  the  sample  (as 
being  unrepresentative  of  the  population  of  interest)  or 
weight  their  ratings  less  than  others. 

Most  often,  lack  of  correspondence  between  observ- 
ers will  be  due  to  differences  in  the  judgment  (rating) 
criteria  adopted.  Even  if  individual  observers  each 
employ  equal-interval  rating  criteria,  criterion  scales  can 
vary  between  observers,  or  the  same  observer  may 
change  criteria  from  one  rating  session  to  another.  As 
a  consequence,  ratings  can  differ  even  though  the  per- 
ception of  the  stimuli  is  the  same  (as  shown  in  fig.  2). 
When  differences  between  observers'  ratings  are  due 
only  to  differences  in  the  criterion  scale  (i.e.,  their  per- 
ceived values  are  the  same),  their  resulting  ratings  will 
be  monotonically  related,  but  not  necessarily  perfectly 
correlated.  But  if  these  observers  employ  equal-interval 
criterion  scales,  the  resulting  ratings  will  also  be  per- 
fectly correlated  (except  for  random  variation). 

Linear  differences  in  ratings  consist  of  "origin"  and 
"interval  size"  components.  Assuming  equal-interval 
scales,  these  two  components  can  be  estimated  for  two 
sets  of  ratings  of  the  same  stimuli  by  simply  regressing 
one  set  on  the  other.  The  intercept  and  slope  coefficients 
of  the  regression  would  indicate  the  origin  and  interval 
size  differences,  respectively.  As  an  example  of  an  origin 
difference,  consider  criterion  scales  of  observers  B  and 
C  in  figure  2.  Remember  that  all  observers  in  figure  2 
are  assumed  to  agree  in  their  perception  of  the  stimuli. 
Observer  B's  and  C's  criterion  scales  have  identical  in- 
terval sizes,  but  B's  scale  is  shifted  up  two  rating  values 
compared  with  C's  scale  (suggesting  that  observer  B 
adopted  more  stringent  criteria,  setting  higher  standards 
than  observer  C).  The  ratings  of  these  two  observers  for 
scenes  1,2,  and  3  can  be  made  identical  by  a  simple 
origin  shift — either  adding  "2"  to  each  of  B's  ratings  or 
subtracting  "2"  from  each  of  C's  ratings. 

Observers'  criterion  scales  can  probably  be  expected 
to  differ  somewhat  by  both  their  origin  and  interval  size. 
As  an  example,  consider  the  criterion  scales  of  observers 
C  and  D  in  figure  2.  The  judgments  for  the  three  stimuli 


(ratings  of  4,  7,  and  10  for  observer  C  and  2,  4,  and  6 
for  observer  D)  indicate  that  these  scales  differ  by  an  or- 
igin shift  of  1.0  and  an  interval  size  of  1.5.  That  is,  the 
relationship  between  the  ratings  of  observers  C  and  D 
is  represented  by  Rq  =  1  +  1.5  Rq,  where  Rq  and  Rq 
indicate  the  ratings  of  observers  C  and  D,  respectively. 

There  is  no  direct  way  to  observe  either  the  perceived 
values  of  the  stimuli  or  the  judgment  criteria  used  by 
the  observer;  both  are  implicit  psychological  processes. 
Thus,  if  two  sets  of  ratings  are  linearly  related,  it  is  im- 
possible to  tell  for  sure  whether  the  ratings  were 
produced  (1)  by  two  observers  who  have  identical 
rating  criterion  scales,  but  perceive  the  stimuli  differ- 
ently; (2)  by  two  observers  who  perceive  the  stimuli  the 
same,  but  use  different  criterion  scales;  or  (3)  by  two  ob- 
servers who  differ  both  in  perception  and  rating  criteria. 
In  our  application  of  the  basic  psychometric  model, 
however,  we  have  taken  the  position  that  perception  is 
a  relatively  consistent  process  that  is  strongly  related  to 
the  features  of  the  stimulus,  while  judgment  (rating) 
criteria  are  more  susceptible  to  the  effects  of  personal, 
social,  and  situational  factors.  This  is  a  theoretical  posi- 
tion that  is  consistent  with  the  Thurstone  and  signal 
detection  theory  models  (Brown  and  Daniel  1987,  Daniel 
and  Boster  1976,  Hays  1969).  Given  this  position,  linear 
differences  (i.e.,  differences  in  origin  and  interval  size) 
between  sets  of  ratings  are  generally  taken  to  be  indica- 
tions of  differences  in  judgment  criteria,  not  differences 
in  perception.  When  differences  in  ratings  are  due  to  the 
criterion  scales  used  by  different  observers  (or  observer 
groups),  psychometric  scaling  procedures  can  adjust  for 
these  effects  and  provide  "truer"  estimates  of  the  per- 
ceived values  of  the  stimuli. 

Linear  differences  between  group  average  criterion 
scales. — A  related  problem  may  arise  where  ratings  of 
two  different  observer  groups  are  to  be  compared.  The 
two  groups  may  on  average  use  different  rating  criteria, 
perhaps  because  of  situational  factors  such  as  when  the 
rating  sessions  of  the  different  groups  occurred.  For  ex- 
ample, time  of  day  may  influence  ratings,  regardless  of 
the  specific  attributes  of  the  stimuli  being  judged.  Scal- 
ing procedures  can  be  used  to  adjust  for  criterion  differ- 
ences (origin  and  interval)  between  observer  groups. 

Lack  of  intraobserver  consistency. — An  individual  ob- 
server's ratings  can  be  inconsistent,  with  different  rat- 
ings being  assigned  to  the  same  stimulus  at  different 
times.  This  problem-is  not  restricted  to  ratings,  but  can 
occur  whenever  an  observer's  perception  and/or  rating 
criterion  boundaries  waver  during  the  rating  exercise, 
so  that,  for  example,  a  given  stimulus  falls  in  the  "6" 
category  on  one  occasion  and  in  the  "5"  category  the 
next. 

Psychometric  models  generally  assume  that  both  the 
perceived  values  and  the  judgment  criteria  will  vary 
somewhat  from  moment  to  moment  for  any  given  stimu- 
lus/observer. This  variation  is  assumed  to  occur  because 
of  random  (error)  factors,  and  thus  is  expected  to  yield 
a  normal  distribution  of  perceived  criterion  values  cen- 
tered around  the  "true"  values  (Torgerson  1958).  Given 
these  assumptions,  the  mean  of  the  resulting  ratings  for 
a  stimulus  indicates  the  "true  value"  for  that  stimulus, 
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and  the  variance  of  the  observer's  ratings  for  that  stimu- 
lus indicates  the  variation  in  underlying  perceived 
values  combined  with  variation  in  rating  criterion 
boundaries.  The  effects  of  inconsistencies  in  an  observ- 
er's ratings  can  be  ameliorated  by  obtaining  a  sufficient 
number  of  judgments  of  each  stimulus  (by  requiring  each 
observer  to  judge  each  stimulus  several  times)  to  achieve 
a  stable  estimate  of  the  perceived  values.  Repeated 
presentation  of  the  same  stimuli  may,  however,  lead  to 
other  problems. 

Perceptual  and  criterion  shifts. — In  some  circum- 
stances there  may  be  a  systematic  shift  in  rating  criteria 
and/or  perception  over  the  course  of  a  rating  session. 
Such  a  shift  could  be  related  to  the  order  in  which  the 
stimuli  are  presented,  or  to  other  aspects  of  the  rating 
session. 3  This  is  a  potential  problem  with  all  types  of 
observer  judgments  where  several  stimuli  are  judged  by 
each  observer.  If  the  problem  is  related  to  order  of 
presentation,  it  can  be  controlled  for  by  presenting  the 
stimuli  to  different  observers  (or  on  different  occasions) 
in  different  random  orders.  If  the  shift  is  primarily  due 
to  changing  criteria,  it  may  be  possible  to  adjust  for  this 
effect  to  reveal  more  consistent  perceived  values. 

Baseline  Adjustments 

It  is  often  necessary  to  combine  the  ratings  for  a  set 
of  stimuli  obtained  in  one  rating  session  with  ratings  for 
another  set  of  stimuli  obtained  in  a  different  rating  ses- 
sion (for  examples,  see  Brown  and  Daniel  (1984)  and 
Daniel  and  Boster  (1976)).  This  need  may  occur,  for  ex- 
ample, when  ratings  are  needed  for  a  large  group  of 
stimuli  that  cannot  all  be  rated  in  the  same  session.  In 
such  cases,  the  investigator's  option  is  to  divide  the  set 
of  stimuli  into  smaller  sets  to  be  rated  by  different  ob- 
server groups,  or  by  the  same  group  in  separate  sessions. 
In  either  case,  it  is  important  that  some  stimuli  are 
common  to  the  separate  groups/sessions;  this  provides 
a  basis  for  determining  the  comparability  of  the  ratings 
obtained  from  the  different  groups/sessions,  and  pos- 
sibly a  vehicle  to  "bridge  the  gap"  between  different 
groups/sessions.  The  subset  of  stimuli  common  to  all 
rating  sessions  is  called  the  baseline. 

If  baseline  stimuli  are  to  be  used  to  determine  com- 
parability between  two  or  more  rating  sessions,  it  is 
important  that  the  baseline  stimuli  be  rated  under  the 
same  circumstances  in  each  case.  Otherwise,  the  ratings 
may  be  influenced  by  unwanted  experimental  artifacts, 
such  as  interactions  between  the  baseline  stimuli  and 
the  other  stimuli  that  are  unique  to  each  session.  To  en- 
hance the  utility  of  baseline  stimuli,  the  following 
precautions  should  be  followed:  (1)  the  observers  for 
each  session  should  be  randomly  selected  from  the  same 

^An  example  of  such  shifts  is  found  in  the  "context"  study  reported 
by  Brown  and  Daniel  (1987).  Two  observer  groups  each  rated  the  scen- 
ic beauty  of  a  set  of  common  landscape  scenes  after  they  had  rated  a 
set  of  unique  (to  the  groups)  scenes.  Because  of  the  differences  between 
the  two  sets  of  unique  scenes,  the  ratings  of  the  initial  common  scenes 
were  quite  different  between  the  groups.  However,  as  more  common 
scenes  were  rated,  the  groups'  ratings  gradually  shifted  toward 
consensus. 


observer  population,  (2)  the  observer  groups  should  be 
sufficiently  large,  (3)  the  baseline  stimuli  should  be 
representative  of  the  full  set  of  stimuli  to  be  rated,  (4) 
the  other  (nonbaseline)  stimuli  should  be  randomly  as- 
signed to  the  different  sessions,^  and  (5)  all  other  aspects 
of  the  sessions  (e.g.,  time  of  day,  experimenter)  should 
remain  constant. 

The  effectiveness  of  a  baseline  is  also  a  function  of  the 
number  of  stimuli  included  in  the  baseline.  The  greater 
the  proportion  of  the  stimuli  to  be  rated  that  are  base- 
line stimuli,  the  more  likely  that  the  baseline  will  ade- 
quately pick  up  differences  in  use  of  the  rating  scale 
between  observers  or  observer  groups,  all  else  being 
equal.  Of  course,  one  must  trade  off  effectiveness  of  the 
baseline  with  the  decrease  in  the  number  of  unique 
stimuli  that  can  be  rated  in  each  session  as  the  baseline 
becomes  larger. 

If  proper  experimental  precautions  are  followed,  it  is 
unlikely  that  the  ratings  will  reflect  substantial  percep- 
tual differences  among  the  different  groups/sessions.  In 
this  case,  given  the  model  described  above,  we  would 
assume  that  any  differences  across  sessions  in  baseline 
ratings  were  due  to  differences  in  judgment  criteria,  not 
differences  in  perception,  and  we  would  then  proceed 
to  use  the  baseline  ratings  to  "bridge  the  gap"  between 
the  rating  sessions. 

In  the  following  section,  we  describe  and  compare  11 
methods  for  scaling  rating  data.  Some  of  these  pro- 
cedures attempt  to  compensate  or  adjust  for  the  poten- 
tial problems  described  above,  and  some  utilize  a 
baseline.  We  do  not  attempt  to  determine  the  relative 
merit  of  these  procedures.  Our  purpose  is  to  provide  the 
reader  with  the  means  to  evaluate  the  utility  of  the  vari- 
ous scaling  procedures  for  any  given  application. 


SCALING  PROCEDURES 

Eleven  scaling  procedures  are  described,  from  the 
simple  median  and  mean  to  the  more  complex  Scenic 
Beauty  Estimation  (SBE)  and  least  squares  techniques. 
All  11  procedures  are  provided  by  RMRATE  (Brown  et 
al.  1990).  All  but  one  of  the  scaling  procedures  provide 
a  scale  value  for  each  stimulus,  and  all  procedures 
provide  scale  values  for  groups  of  stimuli.  In  addition, 
some  of  the  procedures  provide  scale  values  for  each 
rating.  The  scaling  options  are  described  below,  along 
with  some  discussion  of  the  relative  advantages  and  dis- 
advantages of  each. 

Differences  among  the  various  scaling  methods  are  il- 
lustrated using  several  sets  of  hypothetical  rating  data. 
Each  set  of  data  represents  ratings  of  the  same  five  stim- 
uli by  different  groups  of  observers.  For  example,  table 
1  presents  the  ratings  of  three  hypothetical  observer 
groups  (A,  B,  and  C)  each  rating  the  same  five  stimuli 

example  of  where  this  guideline  was  not  followed  is  reported  by 
Brown  and  Daniel  (1987),  where  mean  scenic  beauty  ratings  for  a  con- 
stant set  of  landscape  scenes  were  significantly  different  depending  upon 
the  relative  scenic  beauty  of  other  scenes  presented  along  with  the  con- 
stant (baseline)  scenes.  In  that  study,  the  experimental  design  was  tailored 
precisely  to  encourage,  not  avoid,  differences  in  rating  criteria  by  differ- 
ent observer  groups. 
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(1,  2,  3,  4,  and  5).  Table  1  provides  a  comparison  of 
simple  mean  ratings  and  baseline-adjusted  mean  ratings 
as  scaling  options.  Subsequent  tables  use  some  of  the 
same  rating  sets  (observer  groups],  as  well  as  addition- 
al hypothetical  groups,  to  compare  the  other  scaling  op- 
tions. Additional  comparisons  of  the  scaling  procedures 
are  presented  in  the  appendix. 

Median  Rating 

The  scale  value  calculated  using  this  procedure 
represents  the  numerical  rating  that  is  above  the  ratings 
assigned  by  one-half  of  the  observers  and  below  the 
ratings  assigned  by  the  other  half  of  the  observers.  Thus, 
the  median  is  simply  the  midpoint  rating  in  the  set  of 
ordered  ratings;  e.g.,  among  the  ratings  3,6,  and  2,  the 
median  is  3.  If  there  is  an  even  number  of  observers,  the 
median  is  the  average  of  the  two  midpoint  ratings;  e.g., 
among  the  ratings  2,  4,  5,  and  6,  the  median  is  4.5.  If 
the  ratings  assigned  to  a  stimulus  are  symmetrically 
(e.g.,  normally)  distributed,  the  median  is  equal  to  the 
mean  rating. 

An  advantage  of  the  median  is  that  it  does  not  require 
the  assumption  of  equal-interval  ratings.  The  cor- 
responding disadvantage  is  that  it  provides  only  an 
ordinal  (rank-order)  scaling.  In  terms  of  the  psycholog- 
ical model  presented  above,  selecting  the  median  ratings 
as  the  scale  value  restricts  one  to  simple  ordinal  (greater 
than,  less  than)  information  about  the  position  of  stimuli 
on  the  underlying  psychological  dimension  (e.g.,  per- 
ceived beauty). 


Mean  Rating 

In  many  applications  researchers  have  used  simple 
average  ratings  as  a  scale  value.  The  mean  rating  for  a 
stimulus  is  computed  as: 


MRi  =  ^i^Rij  [1] 

where 

MRj  =  mean  rating  assigned  to  stimulus  i 
Rjj   =  rating  given  to  stimulus  i  by  observer  j 
n     =  number  of  observers. 

Table  1  lists  ratings  by  three  hypothetical  observer 
groups  that  each  rated  5  stimuli.  The  mean  rating  for 
each  stimulus  within  each  data  set  is  also  listed. 

Ratings,  and  mean  ratings,  do  provide  some  indica- 
tion of  the  magnitude  of  differences  between  objects, 
representing  an  improvement  over  ranks  in  the  direc- 
tion of  an  interval  measure.  However,  simply  averaging 
rating  scale  responses  is  potentially  hazardous,  as  it 
requires  the  assumption  that  the  intervals  between  points 
on  the  rating  scale  are  equal.  Some  statisticians  are  very 
reluctant  to  allow  this  assumption,  and  reject  the  use  of 
average  ratings  as  a  valid  measure  of  differences  in 
the  underlying  property  of  the  objects  being  measured. 
Other  statisticians  are  more  willing  to  allow  the  use  of 
mean  ratings,  at  least  under  specified  conditions.  The 
results  of  computer  modeling  studies  support  the  latter 
position.  These  studies  have  shown  that  when  ratings 
are  averaged  over  reasonable  numbers  of  observers 
(generally  from  about  15  to  30)  who  rate  the  same  set 
of  objects,  the  resulting  scale  values  are  very  robust  to 
a  wide  range  of  interval  configurations  in  the  individ- 
ual rating  scales  (see  citations  in  the  Psychological 
Scaling  section,  above,  plus  numerous  papers  in  Kirk 
(1972)). 

To  compare  mean  ratings  of  stimuli  judged  during  a 
given  session,  one  must  assume  that  on  average  the 
rating  criterion  scale  is  equal  interval.  A  group's  rating 
criterion  scale  is  equal  interval  "on  average"  (1)  if  each 
observer  used  an  equal-interval  rating  criterion  scale,  or 
(2)  if  the  deviations  from  equal  intervals  employed  by 
specific  observers  are  randomly  distributed  among  ob- 


Table  1.— Ratings  and  origin-adjusted  ratings  (OARs)  for  tliree  observer  groups. 


Rating  OAR                        Scale  value 

Observer.  .  .           12  3              1           2  3 

Observer  Stimulus  Mean  Mean 

group  rating  OAR 


-B 


C 


1 

1 

3 

6 

-2.0 

-2.0 

-2.0 

3.33 

-2.00 

2 

2 

4 

7 

-1.0 

-1.0 

-1.0 

4.33 

-1.00 

3 

3 

5 

8 

.0 

.0 

.0 

5.33 

.00 

4 

4 

6 

9 

1.0 

1.0 

1.0 

6.33 

1.00 

5 

5 

7 

10 

2.0 

2.0 

2.0 

7.33 

2.00 

1 

1 

2 

1 

-4.0 

-4.0 

-4.0 

1.33 

-4.00 

2 

3 

4 

3 

-2.0 

-2.0 

-2.0 

3.33 

-2.00 

3 

5 

6 

5 

.0 

.0 

.0 

5.33 

.00 

4 

7 

8 

7 

2.0 

2.0 

2.0 

7.33 

2.00 

5 

9 

10 

9 

4.0 

4.0 

4.0 

9.33 

4.00 

1 

1 

2 

2 

-4.0 

-4.0 

-4.0 

1.67 

-4.00 

2 

3 

4 

4 

-2.0 

-2.0 

-2.0 

3.67 

-2.00 

3 

5 

6 

6 

.0 

.0 

.0 

5.67 

.00 

4 

7 

8 

8 

2.0 

2.0 

2.0 

7.67 

2.00 

5 

9 

10 

10 

4.0 

4.0 

4.0 

9.67 

4.00 
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servers  (there  are  no  consistent  deviations,  such  as  all 
or  most  observers  compressing  the  end-points  of  the 
scale).  The  assumption  of  equal-interval  criterion  scales 
is  probably  never  strictly  met  for  individual  observers, 
but  for  sufficiently  large  groups  of  observers  (15  to  30 
or  more,  depending  on  variability  within  the  group)  it 
may  not  be  unreasonable  to  assume  that  "on  average" 
the  intervals  between  categories  are  approximately 
equal. 

The  experimenter  must  decide  whether  and  when  it 
is  appropriate  to  use  mean  ratings  as  an  index  of  prefer- 
ence, quality,  or  whatever  property  is  being  measured. 
In  typical  applications  with  multiple  observers  and  a 
proper  experimental  design,  however,  we  have  rarely 
found  situations  in  which  the  results  of  using  mean 
ratings,  as  compared  to  more  sophisticated  scaling 
methods,  produced  substantive  differences  in  conclu- 
sions, statistical  or  scientific,  regarding  relative  prefer- 
ences or  perceived  quality  (see  also  Schroeder  (1984)). 
However,  use  of  mean  ratings  as  interval  scale  data  must 
be  approached  with  considerable  caution.  In  the  final 
analysis,  differences  between  mean  ratings  will  assured- 
ly indicate  commensurate  differences  on  the  underlying 
psychological  dimension  only  if  the  rating  criterion 
scales  of  relevant  observers  or  groups  are  equal-interval. 

Origin-Adjusted  Rating 

This  procedure  applies  an  origin  adjustment  to  each 
observer's  ratings  prior  to  aggregating  over  observers  to 
obtain  a  group  index  for  a  stimulus.  First,  individual  ob- 
server's ratings  are  transformed  to  origin-adjusted  ratings 
(OARs)  by  subtracting  each  observer's  mean  rating  from 
each  of  his  or  her  ratings  as  follows: 

OARij  =       -  MRj  [2] 

where 

OARjj  =  origin-adjusted  rating  of  stimulus  i  by  ob- 
server j 

Rjj      =  rating  assigned  to  stimulus  i  by  observer  j 
MRj    =  mean  rating  assigned  to  all  stimuli  by  ob- 
server j. 

Then  the  OAR^j  are  averaged  across  observers  in  a 
group,  in  a  similar  fashion  to  the  averaging  of  raw  ratings 
in  equation  [1],  to  give  one  scale  value  for  each  stimulus. 

OARs  of  three  hypothetical  observer  groups  are  listed 
in  table  1.  The  ratings  of  the  three  observers  of  group 
A  have  the  same  interval  size  (the  difference  in  ratings 
between  any  two  stimuli  is  the  same  for  all  observers) 
but  different  origins  (the  mean  ratings  of  the  observers 
differ).  Thus,  when  the  mean  rating  of  each  observer  is 
subtracted  from  each  of  the  observer's  ratings,  the  result- 
ing OARs  of  all  three  observers  are  identical  for  any 
given  stimulus.  That  is,  the  adjustment  has  removed  the 
origin  differences  among  observers  to  reveal,  assuming 
common  perception,  that  the  observers  do  not  differ  in 
how  they  distinguish  the  relative  differences  among 
stimuli.  Similarly,  the  OARs  of  observers  in  groups  B 
and  C  are  identical,  and  the  mean  OARs  of  the  two  sets 
are  identical. 


Baseline-Adjusted  OAR 

When  scale  values  are  needed  for  large  numbers  of 
stimuli,  requiring  two  or  more  separate  rating  groups  or 
sessions,  use  of  a  common  set  of  stimuli,  a  baseline  as 
described  above,  is  recommended.  In  such  situations, 
a  variation  of  the  OAR  technique  may  be  applied,  where- 
by the  origin  adjustment  is  accomplished  by  subtract- 
ing the  mean  of  the  baseline  stimuli  (rather  than  the 
mean  of  all  stimuli)  from  each  rating.  This  baseline- 
adjusted  OAR  is  computed  by: 

BOARjj  =  Rij  -  BMRj  [3] 

where 

BOAR^j  =  baseline-adjusted  OAR  of  stimulus  i  by  ob- 
server j 

Rjj        =  rating  assigned  to  stimulus  i  by  observer  j 
BMRj    =  mean  rating  assigned  to  baseline  stimuli  by 
observer  j. 

The  BOARjj  are  then  averaged  across  observers  in  a 
group  or  session  to  yield  one  scale  value  for  each  stimu- 
lus. Of  course,  the  cautions  regarding  the  proper  design 
of  the  baseline  "bridges"  between  different  rating 
groups/sessions  should  be  carefully  considered. 

The  origin-adjustment  corrects  for  the  effects  of  differ- 
ences in  the  origin  of  observers'  rating  criterion  scales, 
but  not  for  the  effects  of  differences  in  interval  size,  as 
seen  by  comparing  ratings  of  group  A  with  those  of 
groups  B  and  C  in  table  1.  Mean  OARs  are  identical  for 
groups  B  and  C,  which  each  used  an  interval  of  two 
rating  points  for  distinguishing  between  proximate 
stimuli.  Group  A,  however,  exhibits  an  interval  size  of 
only  1,  resulting  in  mean  OARs  that  differ  from  those 
of  the  other  two  groups.  A  more  sophisticated  stand- 
ardized score,  such  as  the  Z-score  presented  next,  ad- 
justs for  both  origin  and  interval  differences  and,  thus, 
is  preferable  to  a  simple  origin  adjustment.  However,  the 
origin-adjusted  rating  is  included  here  to  facilitate  the 
transition  from  simple  mean  ratings  to  more  sophisti- 
cated standardized  scores.  If  an  investigator  is  willing 
to  assume  that  observers/groups  differ  only  in  the  origin 
of  their  rating  criteria,  then  origin-adjusted  ratings  could 
be  taken  as  indicators  of  stimulus  locations  on  the  un- 
derlying (hypothetical)  scale. 

Z-Score 

This  procedure  employs  a  Z-score  transformation  of 
individual  observer's  ratings  prior  to  aggregating  over 
observers  to  obtain  a  group  index  for  a  stimulus.  First, 
individual  observer's  ratings  are  transformed  to  standard 
scores  using  the  conventional  formula: 

Zij  =  (Rij-MRj)/SDRj  [4] 

where 

Zjj    =  Z-score  for  stimulus  i  by  observer  j 
Rjj    =  rating  assigned  to  stimulus  i  by  observer  j 
MRj  =  mean  rating  assigned  to  all  stimuli  by  observer 
j 
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SDRj  =  standard  deviation  of  ratings  assigned  by  ob- 
server j 
n      =  number  of  observers. 

Then  the  Zjj  are  averaged  across  observers  in  the  group 
to  give  one  scale  value  for  each  stimulus. 

Z-scores  have  several  important  characteristics.  For 
each  individual  observer,  the  mean  of  the  Z-scores  over 
the  stimuli  rated  will  always  be  zero.  Also,  the  standard 
deviation  of  the  Z-scores  for  each  observer  will  always 
be  1.0.  Thus,  the  initial  ratings  assigned  by  an  observer, 
which  may  be  affected  by  individual  tendencies  in  use 
of  the  rating  scale,  are  transformed  to  a  common  scale 
that  can  be  directly  compared  between  (and  combined 
over)  observers.  Note  that  this  procedure  allows  direct 
comparison  even  if  different  observers  used  explicitly 
different  rating  scales,  such  as  a  6-point  scale  versus  a 
10-point  scale. 

When  Z-scores  are  computed  for  individual  observers 
by  [4],  the  mean  and  standard  deviation  of  the  resulting 
scale  will  be  changed  to  0  and  1.0,  respectively.  The 
shape  of  the  resulting  Z-score  distribution,  however,  will 
be  the  same  as  that  of  the  original  rating  distribution, 
because  only  a  linear  transformation  of  the  ratings  has 
been  applied  (e.g.,  it  will  not  be  forced  into  a  normal 
distribution).  However,  the  subsequent  procedures  of 
averaging  individual  observer  Z-scores  to  obtain  ag- 
gregate (group)  indices  for  stimuli  makes  individual 
departures  from  normality  relatively  inconsequential. ^ 

The  transformation  effected  by  the  Z-score  computa- 
tion removes  linear  difference  among  observers'  ratings. 

^The  basis  for  this  claim  is  the  same  as  that  which  supports  the  ap- 
plication of  normal  distribution  ("parametric")  statistics  to  data  that  are 
not  normally  distributed. 


All  differences  among  observers'  ratings  that  result  from 
criterion  scale  differences  will  be  linear  if  the  observers 
employed  equal-interval  criterion  scales.  Thus,  to  the  ex- 
tent that  observers'  criterion  scales  were  equal-interval, 
arbitrary  differences  between  observers  in  how  they  use 
the  rating  scale  are  removed  with  the  Z  transformation. 
These  differences  include  both  the  tendency  to  use  the 
high  or  low  end  of  the  scale  (origin  differences)  and 
differences  in  the  extent  or  range  of  the  scale  used  (in- 
terval size  differences),  as  illustrated  in  figure  2.  If  the 
equal-interval  scale  assumption  is  satisfied,  scaling 
ratings  by  the  Z  transformation  allows  any  differences 
among  the  observers'  Z-scores  to  reflect  differences  in 
the  perceived  values  of  the  stimuli. 

Hypothetical  ratings  and  corresponding  Z-scores  are 
listed  in  table  2  for  four  observer  groups.  Three  results 
of  the  Z-score  transformation  can  be  seen  in  table  2.  First, 
the  Z-score  transformation  adjusts  for  origin  differences, 
as  can  be  seen  by  comparing  ratings  and  Z-scores  among 
observers  of  group  A,  or  among  observers  of  group  B. 
Second,  the  transformation  adjusts  for  interval  size 
differences,  as  can  be  seen  by  comparing  ratings  and  Z- 
scores  of  observer  2  of  group  A  with  those  of  observer 
1  of  group  B.  The  combined  effect  of  these  two  adjust- 
ments is  seen  by  examining  group  E,  which  includes  a 
mixture  of  ratings  from  groups  A  and  B.  Finally,  it  is 
seen  by  comparing  groups  B  and  D  that  sets  of  ratings 
that  produce  identical  mean  ratings  do  not  necessarily 
produce  identical  mean  Z-scores.  Two  sets  of  ratings  will 
necessarily  produce  identical  mean  Z-scores  only  if  the 
sets  of  ratings  are  perfectly  correlated  (if  the  ratings  of 
each  observer  of  one  set  are  linearly  related  to  all  other 
observers  of  that  set  and  to  all  observers  of  the  other  set). 


Table  2.— Ratings  and  Z-scores  for  four  observer  groups. 


Rating  Z-score  Scale  value 

Observer.  ..123  1  2  3 

Observer  Stimulus  Mean  Mean 

group  rating  Z-score 


1 

1 

3 

6 

-1.26 

-1.26 

-1.26 

3.33 

-1.26 

2 

2 

4 

7 

-.63 

-.63 

-.63 

4.33 

-.63 

3 

3 

5 

8 

.00 

.00 

.00 

5.33 

.00 

4 

4 

6 

9 

.63 

.63 

.63 

6.33 

.63 

5 

5 

7 

10 

1.26 

1.26 

1.26 

7.33 

1.26 

1 

1 

2 

1 

-1.26 

-1.26 

-1,26 

1.33 

-1.26 

2 

3 

4 

3 

-.63 

-.63 

-.63 

3.33 

-.63 

3 

5 

6 

5 

.00 

.00 

.00 

5.33 

.00 

4 

7 

8 

7 

.63 

.63 

.63 

7.33 

.63 

5 

9 

10 

9 

1.26 

1.26 

1,26 

9.33 

1.26 

1 

1 

2 

1 

-.95 

-1.63 

-1.14 

1.33 

-1.24 

2 

2 

6 

2 

-.63 

-.15 

-.89 

3.33 

-.56 

3 

3 

7 

6 

-.32 

.22 

.10 

5.33 

.00 

4 

5 

8 

9 

.32 

.59 

.84 

7.33 

.58 

5 

9 

9 

10 

1.58 

.96 

1.09 

9.33 

1.21 

1 

1 

6 

1 

-1.26 

-1.26 

-1.26 

2.67 

-1.26 

2 

2 

7 

3 

-.63 

-.63 

-.63 

4.00 

-.63 

3 

3 

8 

5 

.00 

.00 

.00 

5.33 

.00 

4 

4 

9 

7 

.63 

.63 

.63 

6.67 

.63 

5 

5 

10 

9 

1.26 

1.26 

1.26 

8.00 

1.26 
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Baseline-Adjusted  Z-Score 

When  different  observers  have  rated  sets  of  stimuH  that 
only  partially  overlap,  and  their  scores  are  to  be  com- 
pared, baseline  stimuli  can  provide  a  common  basis  for 
transforming  individual  observer's  ratings  into  a  stand- 
ardized scale.  Using  ratings  of  the  baseline  stimuli  as 
the  basis  of  the  standardization,  the  baseline-adjusted 
Z-score  procedure  computes  standard  scores  as: 

BZjj  =  (Rij  -  BMRj)  /  BSDRj  [5] 

v^Ahere 

BZjj    =  baseline-adjusted  standard  score  of  stimulus 

i  for  observer  j 
Rjj      =  rating  of  stimulus  i  by  observer  j 
BMRj  =  mean  rating  of  the  baseline  stimuli  by  ob- 
server j 

BSDRj  =  standard  deviation  of  ratings  of  the  baseline 
stimuli  by  observer  j. 

The  BZjj,  are  then  averaged  across  observers  to  yield  one 
scale  value  per  stimulus  (BZj). 

All  ratings  assigned  by  an  observer  are  transformed 
by  adjusting  the  origin  and  interval  to  the  mean  and 
standard  deviation  of  that  observer's  ratings  of  the  base- 
line stimuli.  BZ,  then,  is  a  standardized  score  based  only 
on  the  stimuli  that  were  rated  in  common  by  all  observ- 
ers in  a  given  assessment.  While  the  standardization 
parameters  (mean  and  standard  deviation)  are  derived 
only  from  the  baseline  stimuli,  they  are  applied  to  all 
stimuli  rated  by  the  observer.  Thus,  as  stated  above,  it 
is  important  that  the  baseline  stimuli  be  reasonably 
representative  of  the  total  assessment  set,  and  that  the 
additional  "nonbaseline"  stimuli  rated  by  the  separate 
groups  (sessions)  are  comparable. 

Given  the  assumptions  described  above,  the 
computed-Z  procedures  transform  each  observer's 
ratings  to  a  scale  that  is  directly  comparable  to  (and  can 
be  combined  with)  the  scale  values  of  other  observers. 
This  is  accomplished  by  individually  setting  the  origin 
of  each  observer's  scale  to  the  mean  of  the  ratings  that 
observer  assigned  to  all  of  the  stimuli  (or  the  baseline 
stimuli).  The  interval,  by  which  differences  between 
stimuli  are  gauged,  is  also  adjusted  to  be  the  standard 
deviation  of  the  observer's  ratings  of  all  (or  the  baseline) 
stimuli.  The  appropriate  scale  value  for  each  stimulus 
is  the  mean  Z  over  all  observers.^ 

The  Z  transformation  is  accomplished  individually  for 
each  observer,  without  reference  to  the  ratings  assigned 
by  other  observers.  An  alternative  procedure  is  to  select 
origin  and  interval  parameters  for  each  observer's  scale 
so  that  the  best  fit  is  achieved  with  the  ratings  assigned 
by  all  of  the  observers  that  have  rated  the  same  stimuli. 

^Both  origin  and  interval  are  arbitrary  for  interval  scale  measures.  The 
origin  for  the  mean  Z-score  scale  (the  zero  point)  will  be  the  grand  mean 
for  all  stimuli  (or  all  baseline  stimuli),  and  the  interval  size  for  the  scale 
will  be  1.0  divided  by  the  square  root  of  the  number  of  observers.  Be- 
cause the  interval  size  depends  on  the  number  of  observers,  one  must 
be  careful  in  making  absolute  comparions  between  mean  Zs  based  on 
different  sized  observer  groups.  This  would  not.  however,  affect  relative 
comparisons  (e.g.,  correlations)  between  groups. 


This  "best  fit"  is  achieved  by  the  least  squares  procedure 
described  next. 

Least  Squares  Rating 

This  procedure  is  based  on  a  least  squares  analysis  that 
individually  "fits"  each  observer's  ratings  to  the  mean 
ratings  of  the  entire  group  of  observers.  There  are  two 
variants  of  the  procedure,  depending  upon  whether 
ratings  of  all  stimuli,  or  only  the  baseline  stimuli,  are 
used  to  standardize  or  fit  the  individual  observer's 
ratings. 

Part  of  the  rationale  for  transforming  observers'  ratings 
to  some  other  scale  is  that  the  ratings  do  not  directly 
reflect  the  associated  values  on  the  assumed  psycho- 
logical dimension  that  is  being  measured.  The  need  for 
transformation  is  most  obvious  when  different  observ- 
ers rate  the  same  objects  using  explicitly  different  rating 
scales;  unstandardized  ratings  from  a  5-point  scale  can- 
not be  directly  compared  or  combined  with  ratings  from 
a  10-point  scale,  and  neither  can  be  assumed  to  directly 
reflect  either  the  locations  of,  or  distances  between,  ob- 
jects on  the  implicit  psychological  scale.  Similarly,  even 
when  the  same  explicit  rating  scale  is  used  to  indicate 
values  on  the  psychological  dimension,  there  is  no 
guarantee  that  every  observer  will  use  that  scale  in  the 
same  way  (i.e.,  will  use  identical  rating  criteria). 

The  goal  of  psychological  scaling  procedures  is  to 
transform  the  overt  indicator  responses  (ratings)  into  a 
common  scale  that  accurately  represents  the  distribution 
of  values  on  the  psychological  dimension  that  is  the 
target  of  the  measurement  effort.  The  Z-score  procedure 
approaches  this  measurement  problem  by  individually 
transforming  each  observers's  ratings  to  achieve  a  stand- 
ardized measure  for  each  stimulus.  Individual  observ- 
er's ratings  are  scaled  independently  (only  with  respect 
to  that  particular  observer's  rating  distribution)  and  then 
averaged  to  produce  a  group  index  for  each  stimulus. 
The  least  squares  procedure,  like  the  Z-score  procedure, 
derives  a  scale  value  for  each  observer  for  each  stimu- 
lus. Individual  observer's  actual  ratings,  however,  are 
used  to  estimate  ("predict")  scores  for  each  stimulus 
based  on  the  linear  fit  with  the  distribution  of  ratings 
assigned  by  the  entire  group  of  observers  that  rated  the 
same  stimuli.  This  estimated  score  is  produced  by 
regressing  the  group  mean  ratings  for  the  stimuli  (MR;) 
on  the  individual  stimulus  ratings  assigned  by  each  ob- 
server (Rij).  The  resulting  regression  coefficients  are 
then  used  to  produce  the  estimated  ratings: 

LSRij  =  aj  +  bj  Rij  [6] 

where 

LSRj  j  =  least  squares  rating  for  stimulus  i  of  observer  j 
R;  j  =  raw  rating  for  stimulus  i  assigned  by  observer  j 
aj  =  intercept  of  the  regression  line  for  observer  j 
bj      =  slope  of  the  regression  line  for  observer  j. 

This  is  done  for  each  observer,  so  that  a  LSRjj  is  esti- 
mated for  each  Rjj. 

Table  3  lists  ratings  and  associated  least  squares  scores 
for  six  observer  groups.  The  table  shows  that  if  the  rat- 
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Table  3. — Ratings  and  least  squares  ratings  for  six  observer  groups. 


Observer.  .  . 

Observer  Stimulus 
group 

Rating 

LSR 

Scale  value 

1 

2 

3 

1 

2 

3 

Mean 
rating 

Mean 
LSR 

A  1 

1 

Q 
O 

O 

O.OO 

o.do 

o.oo 

3.33 

3.33 

2 

2 

4 

7 

4.33 

4.33 

4.33 

4.33 

4.33 

3 

3 

5 

8 

5.33 

5.33 

5.33 

5.33 

5.33 

4 

4 

6 

9 

6.33 

6.33 

6.33 

6.33 

6.33 

5 

5 

7 

10 

7.33 

7.33 

7.33 

7.33 

7.33 

D  1 

•1 

1 

o 
c. 

1 

1  .OO 

1  .oo 

1  -OO 

1.33 

1.33 

2 

3 

4 

3 

3.33 

3.33 

3.33 

3.33 

3.33 

3 

5 

6 

5 

5.33 

5.33 

5.33 

5.33 

5.33 

4 

7 

8 

7 

7.33 

7.33 

7.33 

7.33 

7.33 

5 

9 

10 

9 

9.33 

9.33 

9.33 

9.33 

9.33 

1 

O 
d. 

1 

.0  1 

1  .HI 

1.33 

1.60 

2 

2 

6 

2 

3.43 

4.89 

2.57 

3.33 

3.63 

3 

3 

7 

6 

4.38 

5.99 

5.64 

5.33 

5.34 

4 

5 

8 

9 

6.28 

7.09 

7.94 

7.33 

7.10 

5 

9 

9 

10 

10.08 

8.18 

8.71 

9.33 

8.99 

C  1 

1 

C 

D 

1 

2.67 

2.67 

2 

2 

7 

3 

4.00 

4.00 

4.00 

4.00 

4.00 

3 

3 

8 

5 

5.33 

5.33 

5.33 

5.33 

5.33 

4 

4 

9 

7 

6.67 

6.67 

6.67 

6.67 

6.67 

5 

5 

10 

9 

8.00 

8.00 

8.00 

8.00 

8.00 

r  T 

1 

-1 
1 

1  .U  / 

1  r\~7 

1  U 

1 .33 

1 .41 

2 

3 

4 

2 

3.03 

3.03 

3.07 

3.00 

3.04 

3 

5 

6 

3 

5.00 

5.00 

4.03 

4.67 

4.68 

4 

7 

8 

5 

6.97 

6.97 

5.97 

6.67 

6.63 

5 

9 

10 

9 

8.93 

8.93 

9.83 

9.33 

9.23 

G  1 

1 

3 

3 

2.60 

2.60 

3.36 

2.33 

2.85 

2 

2 

4 

4 

3.07 

3.07 

2.92 

3.33 

3.02 

3 

3 

5 

3 

3.53 

3.53 

3.36 

3.67 

3.48 

4 

4 

6 

2 

4.00 

4.00 

3.79 

4.00 

3.93 

5 

5 

7 

1 

4.47 

4.47 

4.23 

4.33 

4.39 

ings  of  two  observers  in  a  given  group  correlate  perfect- 
ly, they  will  yield  identical  LSRs.  For  example,  the 
ratings  by  all  observers  of  group  A  are  perfectly  corre- 
lated and,  thus,  all  observers  have  identical  LSRs.  The 
same  is  true  for  observers  in  groups  B  and  E,  and  for  ob- 
servers 1  and  2  of  group  F.  However,  unlike  the  Z-score 
procedure,  observers  of  two  different  data  sets  will  not 
necessarily  yield  identical  LSRs,  even  though  their 
ratings  are  perfectly  correlated  or  even  identical  (com- 
pare LSRs  of  observer  1  of  groups  A,  E,  and  G). 

Table  3  also  shows  that  if  ratings  of  all  observers 
within  a  group  are  perfectly  correlated  with  each  other, 
as  in  groups  A,  B,  and  E,  the  group  mean  LSRs  for  the 
stimuli  will  be  identical  to  the  group's  mean  ratings. 
However,  if  ratings  of  one  or  more  observers  in  the  set 
are  not  perfectly  correlated  with  those  of  other  observers, 
the  mean  LSRs  will  not  (except  by  chance]  be  identical 
to  the  mean  ratings,  as  in  group  F.  Finally,  it  can  be  seen, 
by  comparing  groups  B  and  D,  that  identical  mean 
ratings  will  not  necessarily  produce  identical  mean 
LSRs. 

The  LSR  transformation  reflects  an  assumption  of  the 
general  psychometric  model  that  consistent  differences 
between  observers  (over  a  constant  set  of  stimuli)  are  due 
to  differences  in  rating  criteria,  and  that  consistent  differ- 


ences between  stimuli  (over  a  set  of  observers)  indicate 
differences  on  the  underlying  psychological  dimension. 
In  the  LSR  procedure,  individual  observer's  ratings  are 
weighted  by  the  correlation  with  the  group  means.  The 
group  means  are  taken  to  be  the  best  estimate  of  the 
"true"  values  for  the  stimuli  on  the  underlying  percep- 
tual dimension. 

Equation  [6]  can  be  restated  to  better  reveal  how  in- 
dividual observer's  estimated  ratings  are  derived  from 
the  mean  ratings  of  all  observers: 

LSRjj  =  MMR  +  rjj^  (SDMR/SDRj)  (R;,  -  MRj)  [7] 

where 

LSRjj  =  transformed  rating  scale  value  for  stimulus  i 

for  observer  j  (as  above) 
MMR  =  mean  of  the  mean  ratings  assigned  to  all 

stimuli  by  all  observers  in  the  group  (the 

grand  mean) 

rjj^  =  correlation  between  observer  j's  ratings  and 
the  mean  ratings  assigned  by  all  (n)  observers 
in  the  group 

SDMR  =  standard  deviation  of  the  mean  ratings  as- 
signed by  all  observers  in  the  group 
SDR:    =  standard  deviation  of  observer  j's  ratings 
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Rjj      =  rating  assigned  to  stimulus  i  by  observer  j 
MRj     =  mean  of  all  ratings  by  observer  j. 

As  examination  of  [7]  shows,  the  resulting  LSR  values 
for  every  observer  will  have  a  mean  (over  all  stimuli) 
equal  to  the  grand  mean  (MMR).  The  standard  deviation 
of  the  transformed  scale  depends  upon  the  correlation 
between  the  individual  and  group  mean  ratings  and  on 
the  ratio  of  the  individual  and  group  standard  devia- 
tions. As  in  all  regression  procedures,  the  standard  devi- 
ation will  be  less  than  or  equal  to  that  for  the  original 
ratings. 

The  variation  in  each  individual  observer's  least 
squares  scale  (LSRjj)  about  the  group's  grand  mean 
rating  (MMR)  depends  largely  on  how  well  the  observer 
agreed  with  the  group  of  observers  (rj  j^).  The  greater  the 
absolute  value  of  the  correlation,  the  greater  the  varia- 
tion in  the  observer's  LSRs  will  be.  If  rj^  =  0,  for 
example,  observer  j  will  contribute  nothing  toward  dis- 
tinctions among  the  stimuli.  In  effect,  the  least  squares 
procedure  weights  the  contribution  of  each  observer  to 
the  group  scale  values  by  the  observer's  correspondence 
with  the  group.  Thus,  in  table  3,  observers  of  groups  A 
and  B  contribute  equally  to  the  scale  values  of  their 
respective  data  sets,  but  observers  of  group  D  do  not.  Of 
particular  interest  is  observer  group  F.  The  raw  ratings 
of  all  three  observers  have  the  same  range  (8)  and  stand- 
ard deviation  (SDRj  =  3.16),  but  the  correlation  of  an 
observer's  ratings  with  the  group  mean  ratings  (r^^)  is 
slightly  larger  for  observers  1  and  2  (0.995)  than  it  is  for 
observer  3  (0.977).  This  difference  in  correlations  causes 
the  range  and  standard  deviation  of  observer  3's  LSRs 
to  be  smaller  than  those  of  observers  1  and  2. 

The  ratio  of  standard  deviations  in  [7]  (SDMR/SDRj) 
acts  to  mediate  for  differences  among  observers  in  the 
variety  (e.g.,  range)  of  rating  values  used  over  the  set 
of  stimuli.  Observers  who  use  a  relatively  large  range 
of  the  rating  scale,  and  therefore  generate  relatively  large 
differences  between  individual  ratings  and  their  mean 
ratings  (Rjj  -  MRj),  will  tend  to  have  larger  standard 
deviations  (SDR:)  and,  therefore,  smaller  ratios  of  stand- 
ard deviations  (SDMR/SDR:),  thereby  reducing  the  var- 
iation in  the  observer's  LSRs.  Conversely,  the  variance 
of  the  LSRs  of  observers  who  use  a  relatively  small  range 
of  the  rating  scale  will  tend  to  be  enhanced  by  the  ratio 
of  standard  deviations  in  [7].  For  an  example,  consider 
observer  group  E  in  table  3.  The  ratings  of  all  three  ob- 
servers correlate  perfectly,  so  v^^  plays  no  role  in  dis- 
tinguishing among  the  observers'  LSRjj.  However,  the 
standard  deviation  of  observer  3's  ratings  is  larger  than 
that  for  observers  1  and  2.  It  is  this  difference  in  SDRj 
that  adjusts  for  the  difference  in  interval  size  in  the 
ratings,  causing  the  three  observers'  LSRs  to  be  identical. 

Observer  group  G  of  table  3  contains  one  observer 
(number  3)  whose  ratings  correlate  negatively  (-0.65) 
with  the  group  mean  ratings.  The  effect  of  the  least 
squares  procedure  is  to  produce  LSRs  for  observer  3  that 
correlate  positively  (0.65)  with  the  group  mean  ratings. 
The  cause  of  this  sign  reversal  can  be  seen  in  [7],  where 
the  sign  of  rjj^  interacts  with  the  sign  of  (R^j  -  MRj)  to 
reverse  the  direction  of  the  scores  of  an  observer  in  seri- 
ous disagreement  with  the  group  (such  a  person  will 


have  a  negative  v-^^  and  tend  to  have  a  sign  for  (Rjj  - 
MRj)  that  is  contrary  to  the  sign  for  observers  in  agree- 
ment with  the  group).  This  reversal  is  of  small  conse- 
quence for  values  of  rj^^  close  to  0.  But  for  more 
substantial  negative  values  of  rjj^,  the  reversal  is  signifi- 
cant, for  it  in  effect  nullifies  the  influence  on  the  group 
metric  of  an  observer  who  may  actually  have  "opposite" 
preferences  from  the  group.  If  such  a  reversal  is  not 
desired,  the  observer's  ratings  should  be  removed. 
However,  a  substantial  negative  correlation  with  the 
group  can  also  arise  when  the  observer  has  misinter- 
preted the  direction  of  the  rating  scale  (e.g.,  taking  "1" 
to  be  "best"  and  "10"  to  be  "worst,"  when  the  instruc- 
tions indicated  the  opposite).  If  misinterpretation  of  the 
direction  of  the  scale  can  be  confirmed,  a  transforma- 
tion that  reverses  the  observer's  scale,  such  as  that 
provided  by  the  LSR,  would  be  appropriate. 


Baseline-Adjusted  LSR 

The  "baseline"  variant  of  the  least  squares  procedure 
is  the  same  as  the  normal  least  squares  procedure 
described  above,  but  the  regression  is  based  only  on  the 
fit  between  the  individual  and  the  group  for  the  base- 
line stimuli.  Note  that  the  baseline-adjusted  LSR  (BLSR) 
procedure  does  not  provide  a  mechanism  for  absolute 
comparisons  of  LSRs  across  observer  groups,  because  the 
procedure  does  not  adjust  for  linear,  or  any  other,  differ- 
ences between  groups;  the  function  of  the  regression 
procedure  is  to  weight  observers'  ratings,  not  assist  com- 
parability across  groups. 

Comparison  of  Z-Scores  and  LSRs 

The  least  squares  procedures  are  related  to  the  Z-score 
procedures.  Both  involve  a  transformation  of  each  in- 
dividual observer's  ratings  to  another  common  measure- 
ment scale  before  individual  indices  are  averaged  to 
obtain  the  group  index,  and  both  rely  on  the  assump- 
tion of  equal  interval  ratings.  The  Z-score  computation 
transforms  each  individual  rating  distribution  to  a  scale 
with  a  mean  of  0  and  a  standard  deviation  of  1.0.  With 
only  a  slight  modification  in  the  transformation  equa- 
tion, the  rating  scales  could  instead  be  transformed  to 
some  other  common  scale,  such  as  a  scale  with  a  mean 
of  100  and  a  standard  deviation  of  10.  In  any  case,  the 
resulting  Z-scores  for  any  individual  observer  are  a  linear 
transformation  of  the  observer's  initial  ratings  and,  there- 
fore, will  correlate  perfectly  with  the  observer's  initial 
ratings. 

The  least  squares  procedure  also  transforms  each  ob- 
server's ratings  to  a  common  scale,  this  time  based  on 
the  group  mean  ratings.  The  mean  of  the  least-squares 
transformed  scale  for  every  individual  observer  is  the 
grand  mean  rating  over  all  observers,  and  the  standard 
deviation  will  depend  upon  the  standard  deviation  of 
the  original  ratings  and  on  the  obtained  correlation  be- 
tween the  individual's  ratings  and  the  group  average  rat- 
ings. Like  the  Z-score  procedure,  however,  an  individual 
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observer's  LSRs  will  correlate  perfectly  with  the  observ- 
er's initial  ratings. 

The  relationship  between  the  computed  Z-score  ap- 
proach and  the  least  squares  estimation  procedure  can 
be  more  easily  seen  by  rearranging  the  terms  of  the  basic 
regression  equation  [7]  into: 

(LSRjj  -  MMR)/SDMR  =  r^^^  [R^-  -  MRjj/SDRj  [8] 

In  this  arrangement  the  left  term  is  recognized  as 
Zlsr  ■  the  standardized  transform  of  the  least  squares 
estimated  ratings  of  observer  j.  The  right  term  includes 
the  correlation  between  observer  j's  ratings  of  the  stimuli 
and  the  mean  ratings  assigned  by  the  group,  t^^,  and  the 
standardized  form  of  the  observer's  ratings,  Zjj  (see  [4]). 
Note  that  if  |rj  j^i  =  1 .0  (indicating  a  perfect  linear  rela- 
tionship between  observer  j's  ratings  and  the  group  mean 
ratings),  Zlsr  ^nd  Zjj  are  equal.  For  this  to  occur,  the 
observer's  ratings  and  the  group  mean  ratings  would 
have  to  differ  only  by  a  linear  transform;  i.e. ,  they  would 
have  to  be  equal  except  for  their  origin  and  interval  size, 
which  are  arbitrary  for  equal-interval  scales.  Because 
Irjj^l  is  virtually  never  1.0,  the  computed  Z-scores  [Z^) 
will  not  generally  be  equal  to  the  Zlsr  •  ^^id  neither 
will  be  equal  to  the  standardized  group  means.  However, 
unless  the  individual  observer  correlations  with  the 
group  means  differ  substantially,  the  distributions  of 
average  scale  values,  the  mean  Zj  and  the  mean  LSRj, 
will  be  strongly  correlated. 

Unlike  the  computed  Z  scale,  which  is  a  standardized 
scale,  the  least  squares  estimated  scale  is  always  in  terms 
of  the  original  rating  scale  applied;  i.e.,  a  10-point  scale 
will  produce  transformed  scores  that  can  only  be  com- 
pared to  other  scales  based  on  10-point  ratings.  This  may 
be  an  advantage  for  communication  of  the  rating  results; 
for  example,  it  avoids  the  negative  number  aspect  of  the 
Z-score  scale.  At  the  same  time,  care  must  be  exercised 
in  combining  or  comparing  one  least  squares  scale  with 
others,  especially  if  the  other  scales  are  based  on  a  differ- 
ent explicit  rating  scale.  This  comparability  problem  can 
be  overcome,  however,  by  appropriate  transformations 
of  the  final  scale  (as  to  percentiles,  Z-scores,  or  some 
other  "standard"  distribution). 


Scenic  Beauty  Estimate 

Scenic  Beauty  Estimate  (SEE)  scaling  procedures  were 
originally  developed  for  use  in  scaling  ratings  of  scenic 
beauty  of  forest  areas  (Daniel  and  Boster  1976),  but  the 
procedures  are  appropriate  for  use  with  ratings  of  other 
types  of  stimuli.  Both  the  "by-observer"  and  "by-slide" 
options  for  deriving  SBEs  proposed  by  Daniel  and  Boster 
(1976)  are  described  here.  The  derivation  of  individual 
scale  values  in  each  option  follows  Thurstone's  "Law 
of  Categorical  Judgment"  (Torgerson  1958),  modified  by 
procedures  suggested  by  the  "Theory  of  Signal  Detect- 
ability"  (Green  and  Swetts  1966).  Scale  values  are 
derived  from  the  overlap  ("confusion")  of  the  rating  dis- 
tributions of  different  stimuli,  where  the  rating  distri- 
butions are  based  on  multiple  ratings  for  each  stimulus. 
The  overlap  in  stimulus  rating  distributions  indicates  the 


proximity  of  the  stimuli  on  the  underlying  psycholog- 
ical dimension  (e.g.,  perceived  beauty).  SBEs  provide 
an  equal-interval  scale  measure  of  perceived  values, 
given  the  underlying  measurement  theory  and  computa- 
tional procedures,  as  described  by  Hull  et  al.  (1984). 

Following  the  general  psychometric  model  introduced 
earlier,  the  rating  assigned  to  a  stimulus  indicates  the 
relationship  between  the  perceived  value  of  the  stimu- 
lus and  the  categories  on  the  observer's  rating  criterion 
scale  being  applied  on  that  occasion.  For  a  stimulus  to 
be  rated  an  "8",  its  perceived  value  must  be  below  the 
upper  boundary  of  the  "8"  category  on  the  criterion 
scale,  but  above  the  upper  boundary  for  a  rating  of  "7" 
(as  illustrated  by  observer  A  for  stimulus  1  in  fig.  2). 
Thurstone's  Law  of  Categorical  Judgment  proposes  that 
the  magnitude  of  the  difference  between  the  perceived 
value  of  a  stimulus  and  the  location  of  the  lower  bound- 
ary of  a  given  rating  category  (e.g.,  for  an  "8")  can  be 
represented  by  the  unit  normal  deviate  corresponding 
to  the  proportion  of  times  that  the  stimulus  is  perceived 
to  be  above  that  criterion  category  boundary.'^ 

As  Torgerson  (1958)  explains,  the  Law  of  Categorical 
Judgment  relies  on  variation  in  perceived  values.  It  is 
assumed  that  the  perceived  value  of  any  given  stimulus 
varies  from  moment  to  moment  (and  observer  to  ob- 
server) due  to  random  processes,  and  forms  a  normal  dis- 
tribution on  the  underlying  psychological  continuum. 
The  locations  of  the  individual  category  boundaries  also 
vary  from  moment  to  moment  due  to  random  processes, 
acting  much  like  stimuli,  each  forming  a  normal  distri- 
bution on  the  psychological  continuum.  The  momentary 
values  for  a  particular  stimulus  and  for  the  criterion 
category  boundaries  determine  the  rating  that  will  be  as- 
signed to  that  stimulus  in  a  given  instance. 

The  area  under  the  theoretical  normal  distribution  of 
perceived  values  for  a  given  stimulus  can  be  divided  into 
the  portion  corresponding  to  the  number  (proportion) 
of  times  the  stimulus  is  perceived  to  be  higher  on  the 
dimension  of  interest  than  a  given  category  boundary, 
and  the  remaining  portion  corresponding  to  the  num- 
ber (proportion)  of  times  the  stimulus  is  perceived  to  be 
lower  than  the  given  boundary.  These  proportions,  in 
turn,  can  be  translated  to  standard  deviation  units,  or 
unit  (standard)  normal  deviates  (commonly  referred  to 
as  Zs).  The  unit  normal  deviate  corresponding  to  the 
proportion  of  times  a  stimulus  is  rated  at  or  above  a  given 
rating  category  indicates  the  magnitude  of  the  difference 
between  the  perceived  value  of  the  stimulus  and  the  lo- 
cation of  the  lower  boundary  of  that  rating  category  on 
the  underlying  dimension.  In  other  words,  Thurstone's 
judgment  scaling  model  assumes  that  differences  in  dis- 
tances on  the  underlying  psychological  continuum  are 
proportional  to  the  unit  normal  deviates  associated  with 
the  observed  proportions  (based  on  the  ratings  assigned). 

^Torgerson  (1958)  presents  the  Law  of  Catergorical  Judgment  in  terms 
of  ttie  proportion  of  times  a  stimulus  is  perceived  to  be  below  the  upper 
boundary  of  a  given  rating  category.  Torgerson's  approach  and  the  one 
presented  here  yieid  perfectly  correlated  scale  values.  The  approach  used 
here,  which  was  also  used  by  Daniel  and  Boster  (1976),  has  the  advan- 
tage of  assigning  higher  scale  values  to  the  stimuli  that  were  assigned 
higher  ratings. 
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In  Thurstone's  full  model  (Torgerson  1958),  the  differ- 
ence between  the  perceived  value  of  a  stimulus  and  the 
location  of  a  category  boundary  is: 

Ck  -  Si  =  (CPik)  (a;  +  -  2rik  a^O-^  [9] 
where 

=  location  of  the  lower  boundary  of  the  k*'^ 
category  on  the  rating  scale  (e.g.,  the  perceived 
scenic  beauty  value  sufficient  to  meet  the  ob- 
server's standards  for  a  rating  of  at  least  "8") 
Sj  =  scale  value  (e.g.,  perceived  scenic  beauty]  of 
stimulus  i 

CPjj^  =  proportion  of  times  stimulus  i  is  rated  above  the 
lower  boundary  of  the  k*^  rating  category 
=  inverse  normal  integral  function  (which  trans- 
lates CPji^,  the  cumulative  proportion,  to  the 
appropriate  unit  normal  deviate,  Z) 

a;     =  dispersion  (standard  deviation]  of  the  stimulus 
value  distribution 

aj^     =  dispersion  of  the  category  boundary  distri- 
bution 

rji^    =  correlation  between  positions  of  stimulus  i  and 
category  boundary  k. 

Simplifying  assumptions  are  necessary  to  apply  Thur- 
stone's model,  because  a^,  aj,,  and  rjj^  are  unknown  and 
may  be  unique  for  any  pairing  of  a  stimulus  and  a 
category  boundary,  causing  the  standard  deviation  units 
in  which  each  estimate  of  Cj^  -  S;  is  expressed  to  also 
be  unique.  If  we  assume  that  and  Sj  are  normally  dis- 
tributed and  independent  for  all  k  and  i,  so  that  rjj^  =  0, 
and  that  aj  and  aj^  are  unknown  constants  for  all  values 
of  i  and  k,  so  that  the  variances  of  stimulus  distributions 
and  response  criterion  distributions  are  respectively 
homogeneous  (Torgerson's  "Condition  D,"  1958),  [9] 
reduces  to: 

Ck  -  Si  =  *-MCPik)  a  [10] 

where  a  is  an  unknown  constant^  and  $"^(CPik)  is  sim- 
ply the  standard  normal  deviate  (Z)  corresponding  to  the 
cumulative  proportion  CPj^.  As  noted  by  Torgerson 
(1958)  and  Hull  et  al.  (1984),  these  simplifying  assump- 
tions are  generally  tenable  and  greatly  reduce  computa- 
tional complexity.  Note  that  a  can  be  assumed  to  be  1.0 
since  an  interval  scale  is,  in  any  case,  determined  only 
to  within  a  linear  transformation  (both  origin  and  inter- 
val are  arbitrary). 

The  unit  normal  deviates  (Zs)  are  computed  for  dif- 
ferences between  Sj  and  each  of  the  rating  category 
boundaries  (e.g.,  based  on  the  proportion  of  times  stimu- 
lus i  is  rated  at  or  above  a  "7",  an  "8",  etc.).  Torgerson 
(1958)  shows  that,  given  a  complete  matrix  of  Zs  and  the 
simplifying  assumptions  mentioned  above,  the  mean  of 
the  Zs  averaged  across  the  category  boundaries  is  the  best 
estimate  of  the  scale  value  for  a  stimulus.  This  scale 
value  (mean  Z)  indicates  the  average  distance,  in  stand- 
ard deviation  units,  of  the  perceived  value  of  the  stimu- 
lus from  the  different  rating  category  boundaries. 

is  the  interval  size  of  the  theoretical  scale  on  which  the  differences 
are  measured,  (o-  +  -  2r^^  o.  o^j°^,  which  reduces  to  (o-  +  o^f  ^  given 
the  assumption  that  r.^^  =  0. 


Mean  Zs  are  computed  for  each  stimulus.  Given  the 
necessary  assumptions,  the  mean  Zs  indicate  the  rela- 
tive positions  of  the  stimuli  on  the  underlying  psycho- 
logical continuum.  So  long  as  the  mean  rating  category 
boundaries  remain  consistent  across  stimuli  being  rated, 
the  dij;ference  between  the  perceived  values  for  any  two 
stimuli  will  be  unaffected  by  the  relative  locations  of  the 
category  boundaries.^  The  differences  between  stimuli 
are  not  affected  by  observers'  rating  (criterion]  biases; 
whether  observers  choose  to  apply  "strict"  criteria  (tend- 
ing to  assign  low  ratings  to  all  stimuli]  or  "lax"  criteria 
(tending  to  assign  high  ratings],  the  scaled  differences 
between  the  stimulus  values  will  be  the  same.  Indeed, 
the  stimulus  differences  will  be  the  same  even  though 
entirely  different  ratings  scales  were  applied  (e.g.,  an 
8-point  scale  versus  a  10-point  scale). Moreover,  the 
scaled  differences  between  stimulus  values  will  be  the 
same  regardless  of  how  the  category  boundaries  are  ar- 
ranged. This  feature  of  the  Thurstone  scaling  pro- 
cedures assures  that  the  measure  of  stimulus  differences 
can  be  interpreted  as  an  equal-interval  scale  even  though 
the  category  boundaries  might  not  be  equally  spaced. 
Thus,  if  the  necessary  assumptions  are  met,  ratings 
which  may  only  achieve  an  ordinal  level  of  measurement 
provide  the  basis  for  an  interval  scale  measure  of  the  per- 
ceived values  of  the  stimuli. 

In  practice,  Thurstone's  model  relies  on  multiple  rat- 
ings for  a  stimulus  to  provide  the  proportions  cor- 
responding to  the  theoretical  momentary  locations  of  the 
perceived  values  of  the  stimuli  and  category  boundaries. 
Ratings  may  be  provided  either  by  multiple  observers 
who  each  rate  the  same  stimuli  or  by  a  single  observer 
who  rates  each  stimulus  a  number  of  times.  The  normal- 
ity and  constant  variance  assumptions  are  perhaps  most 
easily  met  in  the  case  where  a  single  observer  provides 
all  the  necessary  ratings  (where  replications  of  ratings 
of  a  given  stimulus  are  provided  by  the  same  observer]. 
In  this  case,  as  long  as  the  observer  is  consistent  with 
respect  to  the  mean  locations  of  the  rating  criterion 
boundaries  and  adheres  to  the  independence  and 
homogeneous  variance  expectations,  the  ratings  are  all 
based  on  the  same  set  of  boundaries.  A  practical  problem 
with  this  case,  however,  is  that  it  places  a  considerable 
burden  on  a  single  observer,  who  may  become  bored  or 
otherwise  affected  by  the  requirement  to  rate  the  same 
stimuli  again  and  again. 

Scale  values  that  are  specific  to  individual  observers 
can  also  be  generated  when  stimuli  are  grouped  into 
"conditions,"  as  in  Daniel  and  Boster's  (1976)  "by- 

^Note  that  (C^  -  S, j  -  fC^  -  S_  ,j  =  S.  ,  -  S.,  and  that  this  holds 
across  all  category  boundaries  (all  C. ) .  That  is,  if  the  rating  criterion 
boundaries  are  consistent,  the  Ck  "drop  out"  and  the  differences  be- 
tween the  scale  values  of  the  stimuli  indicate  differences  in  perceived 
value. 

^°This  assumes,  of  course,  that  each  scale  has  enough  categories  to 
allow  for  sufficiently  fine  distinctions  among  the  perceived  values  of  the 
stimuli.  A  3-point  scale,  for  example,  would  not  generally  allow  for  suffi- 
cient discrimination  among  a  set  of  stimuli. 

^  ^  This  assumes,  of  course,  that  the  categories  are  properly  ordered,  with 
each  successive  criterion  signifying  more  (for  example)  of  the  underly- 
ing property  being  measured. 
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observer"  SBE.  "Conditions"  are  simply  higher  order 
stimuli  that  can  each  be  represented  to  observers  by  mul- 
tiple stimuli.  Ratings  of  the  stimuli  within  a  condition 
provide  the  necessary  replications  as  long  as  the  condi- 
tions are  each  relatively  homogeneous  and  the  stimuli 
within  a  condition  are  randomly  selected.  As  in  the 
single  observer  case,  this  approach  produces  observer- 
specific  scale  values,  such  that  each  scale  value  only 
relies  on  one  observer's  set  of  rating  criteria.  However, 
because  scale  values  are  only  produced  for  conditions 
(i.e.,  groups  of  stimuli),  and  not  all  objects  of  interest 
are  easily  represented  by  a  sufficient  number  of  random- 
ly selected  stimuli,  this  approach  cannot  always  be 
applied. 

The  most  commonly  applied  case  is  where  each 
observer  only  rates  each  stimulus  once  and  each  stimu- 
lus is  independent  of  all  others.  Here,  the  replications 
necessary  for  creating  the  rating  distributions  must  be 
provided  by  combining  ratings  over  multiple  observers. 
Because  each  stimulus  scale  value  is  based  on  the  rating 
criteria  of  multiple  observers,  it  must  be  assumed  that 
the  constant  mean  and  variance  assumptions  hold  across 
observers.  This  assumption  is  more  likely  to  be  met  if 
observers  are  randomly  sampled  from  some  relevant  ob- 
server population,  but  it  is  a  more  stringent  assumption 
than  that  required  in  the  single-observer  applications. 

By-Stimulus  SBE 

The  "by-stimulus"  option  (Daniel  and  Boster's  (1976) 
"by-slide"  option)  requires  that  multiple  observers  rate 
each  stimulus.  The  by-stimulus  procedure  is  generally 
used  when  only  one  or  a  few  stimuli  are  available  for 
each  condition  of  interest,  or  when  preference  values  are 
needed  for  each  stimulus.  In  this  procedure,  a  mean  Z 
for  each  stimulus  is  computed  based  on  the  distribution 
of  ratings  assigned  by  the  different  observers.  The  cumu- 
lative proportion  of  observers  judging  the  stimulus  to 
be  at  or  above  each  rating  category  is  transformed  to  a 
Z  by  reference  to  the  standard  normal  distribution.  The 
Zs  are  then  averaged  over  the  rating  categories  to  yield 
a  mean  Z  for  each  stimulus.  This  procedure  requires  the 
assumption  that  perceived  values  of  stimuli  and  rating 
criterion  boundaries  are  normally  distributed  over  the 
multiple  observers. 

^^Readers  desiring  a  more  thorough  explanation  of  Thurstone's  cate- 
gorical judgment  scaling  model  should  consult  Torgerson  (1958).  and  ex- 
amine the  chapter  on  the  Law  of  Comparative  Judgment  before  going 
on  to  the  chapter  on  the  Law  of  Categorical  Judgment. 

^^When  this  baseline  is  only  used  to  set  the  origin  of  the  SBE  scale 
based  on  ratings  obtained  in  one  rating  session,  the  stimuli  comprising 
the  baseline  can  be  selected  so  that  an  SBE  of  zero  indicates  some  specif- 
ic condition.  For  example,  in  assessing  the  scenic  beauty  of  forest  areas 
managed  under  different  harvest  methods,  this  condition  has  often  been 
the  set  of  scenes  sampled  from  an  area  representing  the  pretreatment 
state  of  the  forest.  However,  when  the  SBEs  of  two  or  more  rating  ses- 
sions are  to  be  compared,  the  baseline  is  also  used  to  '  'bridge  the  gap ' ' 
between  ratings  obtained  in  those  different  sessions.  In  this  case,  the 
baseline  stimuli  might  best  be  selected  to  be  representative  of  the  full 
range  of  stimuli  being  rated,  as  described  in  the  Psychological  Scaling 
section. 


Individual  mean  Zs  for  each  stimulus  are  further  ad- 
justed, following  a  procedure  suggested  by  the  Theory 
of  Signal  Detectability,  to  a  common  "rational"  origin. 
A  subset  of  the  stimuli  called  a  "baseline"  is  selected 
to  determine  the  origin  of  the  SBE  scale. ^-^  The  overall 
mean  Z  of  the  baseline  stimuli  is  subtracted  from  the 
mean  Z  of  each  stimulus,  and  then  the  difference  is  mul- 
tiplied by  100  (eliminating  the  decimals)  to  yield  in- 
dividual stimulus  SBEs.  As  with  any  interval  scale,  of 
course,  both  the  origin  and  interval  size  are  arbitrary. 

To  summarize,  the  computation  of  the  original  (Daniel 
and  Boster  1976)  SBE  for  a  stimulus  requires  three  steps. 
First,  the  mean  Z  for  each  stimulus  is  computed  as 
follows: 

-1  m 

MZi  =  ^^L^<i>-i  (CPik)  [11] 

where 

MZj  =  mean  Z  for  stimulus  i 

4>~^  =  inverse  normal  integral  function 

CPji^  =  proportion  of  observers  giving  stimulus  i  a 

rating  of  k  or  more 
m     =  number  of  rating  categories. 

In  step  2,  the  mean  of  the  mean  Zs  of  the  stimuli  com- 
posing the  baseline  condition  is  computed.  In  the  last 
step,  the  mean  Z  of  each  stimulus  is  adjusted  by  sub- 
tracting the  mean  Z  of  the  baseline,  and  the  mean  Z 
differences  are  multiplied  by  100  to  remove  decimals: 

SBEj  =  (MZ;  -  BMMZ)  100  [12] 

where 

SBEj     =  SBE  of  stimulus  i 

MZj      =  mean  Z  of  stimulus  i 

BMMZ  =  mean  of  mean  Zs  of  the  baseline  stimuli. 

Two  conventions  are  used  to  facilitate  computation  of 
MZj  in  [11].  First,  because  all  ratings  of  any  stimulus 
must  be  at  or  above  the  lowest  (e.g.,  the  "1")  category, 
so  that  CPji^  for  the  bottom  rating  category  is  always  1.0, 
the  bottom  category  is  omitted  in  the  computation  of 
MZj.  Second,  Thurstone's  model  only  strictly  applies 
where  the  distribution  of  ratings  for  a  stimulus  extends 
over  the  full  range  of  the  rating  scale  (i.e.,  where 
each  stimulus  is  placed  at  least  once  in  each  criterion 
category).  Where  this  does  not  occur  (where  CPj|^  =  1.0 
for  rating  categories  at  the  low  end  of  the  scale  or 
CPj]^  =  0  for  categories  at  the  high  end),  <i>^'(CPj]^)  is 
undefined.  For  these  cases,  we  have  adopted  the  conven- 
tion proposed  by  Bock  and  Jones  (1968)  and  adopted  by 
Daniel  and  Boster  (1976):  for  CPji^  =  1.0  and  CP,]^  =  0, 
substitute  CP■^^  =  l-l/(2n)  and  CPj,^  =  l/(2n),  respec- 
tively, where  n  is  the  number  of  observations  (ratings) 
for  each  stimulus.^'* 

^"For  example,  if  n  =  30,  a  cumulative  proportion  of  1.0  is  set  to 
0. 9833,  and  a  cumulative  proportion  of  0  is  set  to  0.0167.  The  Zs  of  these 
cumulative  proportions,  like  those  of  the  other  cumulative  proportions, 
could  then  be  obtained  from  a  normal  probability  table.  For  example,  the 
Zfora  cumulative  probability  of  0.0167  is  -2.13.  Note  that  in  many  presen- 
tations of  the  normal  probability  table,  only  the  upper  half  of  the  normal 
curve  areas  is  tabulated.  In  such  cases,  the  cumulative  probability  must 
be  appropriately  adjusted  before  using  the  table  to  determine  the  Z. 


15 


A  variation  on  the  original  SBE  procedure  is  to  also 
adjust  the  interval  size  of  the  SBE  scale  by  dividing  the 
original  SBE  by  the  standard  deviation  of  the  mean  Zs 
of  the  baseline  stimuli.  For  this  variation,  the  SBE  of 
equation  [12]  is  adjusted  to  the  interval  size  of  the  base- 
line to  effect  a  standardization  of  the  mean  Zs: 

SBE*j  =  SBE;  /  BSDMZ  [13] 

w^here 

SBE*i    =  standardized  SBE  of  stimulus  i 
BSDMZ  =  standard  deviation  of  mean  Zs  of  baseline 
stimuli. 

The  combination  of  the  origin  and  interval  size  adjust- 
ments effectively  standardizes  the  SBEs  to  the  baseline. 
This  standardization  is  particularly  useful  where  SBEs 
of  different  observer  groups  who  have  rated  different 
nonbaseline  stimuli  are  to  be  combined  or  otherwise 
compared.  Although  the  computation  of  mean  Zs, 
described  above,  theoretically  creates  an  equal-interval 
scale,  it  does  not  assure  that  the  scales  of  different  groups 
of  observers  will  have  the  same  origin  or  interval  size. 
The  original  SBE  was  designed  to  adjust  for  the  possi- 
bility that  different  observer  groups  may  dil  fur  in  the  ori- 
gin of  their  scores.  The  full  standardization  of  the  mean 
Zs  based  on  the  ratings  of  the  baseline  stimuli  is 
designed  to  adjust  for  the  possibility  that  different  ob- 
server groups  may  differ  in  the  origin  and/or  interval  size 
of  their  scores. 

Table  4  depicts  by-stimulus  SBEs  and  associated  rat- 
ings for  four  observer  groups.  The  baseline  of  each  set 
is  the  full  set  of  stimuli.  The  ratings  of  all  three  observ- 


ers within  each  of  groups  A,  B,  and  E  are  perfectly  cor- 
related, although,  as  seen  by  examining  the  mean 
ratings,  the  interval  sizes  for  each  group  are  different. 
Examining  the  SBEs  for  these  three  data  sets,  we  see  that 
the  origins  of  all  three  are  identical  (stimulus  3  of  each 
set  has  an  SBE  of  0),  but  the  interval  sizes  differ.  Mov- 
ing to  the  SBE*s,  we  see  that  the  interval  sizes  among 
the  three  sets  are  now  also  identical,  as  would  be  ex- 
pected following  a  standardization  of  the  mean  Zs  to  the 
baseline  where  the  ratings  of  all  observers  of  the  three 
sets  correlate  perfectly.  Thus,  agreement  (in  absolute 
terms)  between  two  observer  groups'  scale  values  is  im- 
proved by  adjusting  for  both  origin  and  interval  size 
differences.  Of  course,  neither  adjustment  affects  the 
linear  association  between  the  sets  of  scale  values.  It  can 
also  be  seen  by  comparing  observer  groups  B  and  D  of 
table  4  that  equal  mean  ratings  between  data  sets  does 
not  necessarily  lead  to  equal  SBEs  or  SBE*s  if  the  two 
sets  of  ratings  are  not  perfectly  correlated. 

By-Observer  SBE 

The  by-observer  option  requires  that  each  observer 
provide  multiple  ratings  of  each  condition  (e.g.,  forest 
area)  that  is  to  be  scaled.  This  may  be  accomplished  by 
having  each  observer  rate  the  same  stimulus  a  number 
of  times  on  different  occasions.  Usually,  however,  this 
is  accomplished  by  having  an  observer  rate  a  number 
of  different  stimuli  representing  each  condition  (e.g., 
different  scenes  from  within  the  same  forest  area).  The 
distribution  of  an  individual  observer's  ratings  of  the 


Table  4.— Ratings  and  by-stinnulus  SBEs  for  four  observer  groups. 


Rating  Scale  value 

Observer.  .  .  12  3 

Observer  Stimulus  Mean  SBE^  SBE*= 

group  rating 


1 

1 

3 

6 

3.33 

-43 

-126 

2 

2 

4 

7 

4.33 

-22 

-63 

3 

3 

5 

8 

5.33 

0 

0 

4 

4 

6 

9 

6.33 

22 

63 

5 

5 

7 

10 

7.33 

43 

126 

1 

1 

2 

1 

1.33 

-86 

-126 

2 

3 

4 

3 

3.33 

-43 

-63 

3 

5 

6 

5 

5.33 

0 

0 

4 

7 

8 

7 

7.33 

43 

63 

5 

9 

10 

9 

9.33 

86 

126 

1 

1 

2 

1 

1.33 

-87 

-125 

2 

2 

6 

2 

3.33 

-47 

-68 

3 

3 

7 

6 

5.33 

3 

4 

4 

5 

8 

9 

7.33 

46 

66 

5 

9 

9 

10 

9.33 

85 

123 

1 

1 

6 

1 

2.67 

-62 

-126 

2 

2 

7 

3 

4.00 

-31 

-63 

3 

3 

8 

5 

5.33 

0 

0 

4 

4 

9 

7 

6.67 

3-f 

63 

5 

5 

10 

9 

8.00 

62 

126 

^Baseline  is  the  entire  set  of  (i.e.,  all  5)  stimuli. 
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multiple  stimuli  for  a  condition  is  then  used  to  derive 
a  scale  value  for  that  condition.  Individual  observers' 
rating  distributions  are  "normalized"  by  transforming 
the  proportion  of  stimuli  assigned  to  each  rating  category 
to  the  appropriate  unit  normal  deviate,  or  Z.  This  pro- 
cedure requires  the  assumption  that  ratings  by  an  ob- 
server of  the  stimuli  within  a  condition  are  sampled  from 
an  underlying  normal  distribution.  Zs  for  each  rating 
category  are  then  averaged  to  yield  a  mean  Z  for  each 
individual  observer  for  each  condition.  These  compu- 
tations are  summarized  as  follows: 

-|  m 

MZj,  =  ^^5^<i>-i  (CPj.k)  [14] 

where 

MZjc  =  mean  Z  of  observer  j  for  condition  c 

^-1     =  inverse  normal  integral  function 

CPjj,]^  =  proportion  of  stimuli  of  condition  c  given  a 

rating  of  k  or  more  by  observer  j 
m       =  number  of  rating  categories. 

The  two  conventions  listed  for  [11]  also  apply  to  [14]. 

Individual  observer  mean  Zs  for  each  condition  are 
then  adjusted  to  the  origin  of  a  common  baseline.  Each 
observer's  overall  mean  Z  for  the  baseline  condition(s) 
is  subtracted  from  the  mean  Z  for  each  of  the  conditions 
being  assessed.  The  baseline  condition  is  thus  assigned 
a  value  of  zero.  The  origin-adjusted  mean  Zs  are  then 
multiplied  by  100  to  yield  individual  observer  SBEs  for 
each  condition: 

SBEjj-  =  (MZj^  -  BMZj)  100  [15] 

where 

SBEj^  =  SBE  of  observer  j  for  condition  c 
MZjj,  =  mean  Z  of  observer  j  for  condition  c 
BMZj  =  mean  Z  of  observer  j  for  the  baseline. 

Individual  observer  SBEs,  adjusted  to  the  same  base- 
line, may  then  be  averaged  to  derive  an  aggregate  or 
group  SBE  value  for  each  condition: 

SBEc  =  ^   E^SBEjj,  [16] 

where 

SBEj,  =  SBE  for  condition  c 
n      =  number  of  observers. 

Note  that  the  by-observer  SBE  described  here  is  the 
same  as  the  one  presented  by  Daniel  and  Boster  (1976), 
who  provide  a  detailed  example  of  the  computation  of 
by-observer  SBEs.  We  do  not  introduce  a  variation  to 
their  procedure  similar  to  the  standardization  variation 
presented  above  for  the  by-stimulus  SBE.  The  by- 
observer  computations  do  not  offer  a  similar  opportu- 
nity for  standardization  unless  scores  are  combined 
across  observers,  and  to  combine  across  observers  would 
eliminate  a  key  feature  of  the  by-observer  procedure, 
which  is  individual  interval  scale  scores  for  each 
observer. 


Comparison  of  By-Stimulus  and  By-Observer  SBEs 

The  principal  difference  between  the  two  SBE  pro- 
cedures is  in  whether  the  final  SBE  index  is  derived  from 
the  distribution  of  ratings  of  multiple  stimuli  by  a  single 
observer,  or  from  the  distribution  of  ratings  by  multiple 
observers  for  a  single  stimulus.  The  by-observer  pro- 
cedure uses  the  distribution  of  ratings  of  multiple  stimuli 
within  a  condition  by  one  observer  to  derive  that  ob- 
server's SBE  for  that  condition.  In  so  doing,  it  is  not  pos- 
sible to  obtain  an  SBE  measure  for  each  stimulus;  the 
variation  among  stimuli  is  used  to  derive  the  condition 
SBE.  The  by-stimulus  procedure  uses  the  distribution 
of  ratings  by  multiple  observers  for  a  single  stimulus  to 
derive  an  SBE  for  that  stimulus.  By  this  procedure  it  is 
not  possible  to  obtain  an  SBE  measure  for  each  observer; 
the  variation  among  observers  is  used  to  derive  the  SBE 
for  a  stimulus.  A  condition  SBE  can  be  computed  from 
stimulus  SBEs  by  averaging  over  stimuli,  however,  if 
there  is  an  adequate  sample  of  stimuli  to  represent  the 
condition. 

The  choice  between  the  two  SBE  procedures  typical- 
ly is  determined  by  the  design  of  the  assessment  experi- 
ment. If  a  relatively  small  number  of  conditions,  each 
represented  by  a  number  of  different  stimuli,  are  to  be 
assessed,  the  by-observer  procedure  may  be  used.  Usual- 
ly at  least  15  stimuli,  randomly  sampled  from  each  con- 
dition, are  required  to  make  the  normal  distribution  of 
stimulus  ratings  assumption  tenable.  If  there  are  many 
conditions  each  represented  by  only  one  or  a  few  stimuli, 
the  by-stimulus  procedure  typically  must  be  used.  Usual- 
ly at  least  15  randomly  assigned  observers  are  required 
to  meet  the  normal  distribution  of  observer  ratings  as- 
sumption. When  data  having  multiple  observers  and 
multiple  stimuli  for  each  condition  have  been  analyzed 
by  both  the  by-observer  and  the  by-stimulus  procedures, 
the  resulting  condition  SBEs  have  typically  been  found 
to  be  essentially  identical.  In  practice,  situations  allow- 
ing the  by-observer  procedure  (i.e.,  where  at  least  15  ran- 
domly sampled  stimuli  are  available  to  represent  each 
condition  assessed)  have  been  relatively  infrequent.  But, 
in  such  situations,  as  long  as  at  least  15  observers  are 
used,  the  by-stimulus  procedure  can  usually  be  applied 
with  mathematically  equivalent  results. 

Comparison  of  SBEs  and  Mean  Ratings 

The  by-stimulus  SBE  is  distinguished  from  the  mean 
rating  of  [1]  by  the  transformation  to  standard  normal 
deviates.  This  is  shown  by  recognizing  the  relationship 
between  the  mean  rating  and  the  sum  of  the  proportions 
of  ratings  in  each  rating  category: 
1  m 

MRi=m-j^5^kPi,,  [17] 

where 

MRj  =  mean  rating  of  stimulus  i 
Pjl^  =  proportion  of  observers  giving  stimulus  i  a  rat- 
ing of  k 

m    =  number  of  rating  categories. 
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Thus,  the  important  difference  between  the  mean  rating 
of  [1]  (MR;)  and  the  mean  Z  of  [11]  (MZj)  is  that  in  the 
mean  rating  the  proportion  [P[]J  is  weighted  by  the 
rating  value  (k),  while  in  the  mean  Z  the  cumulative 
proportion  (CP;^)  is  weighted  by  the  inverse  normal  in- 
tegral function  (<I>~^].  Other  differences  between  the 
mean  rating  and  the  SBE,  the  standardization  to  the  base- 
line and  multiplication  by  100  in  [12],  merely  cause  a 
linear  transformation. 

To  compare  mean  ratings  of  stimuli  judged  during  a 
given  session,  one  must  assume  that  on  average  the 
group's  rating  criterion  scale  is  equal  interval,  plus  of 
course  that  the  rating  criterion  scale  is  consistent  for  the 
duration  of  the  rating  session.  To  compare  mean  ratings 
of  two  different  observer  groups,  we  must  also  assume 
that  the  rating  criterion  scales  of  the  two  groups  are  iden- 
tical. But  to  use  SBEs  to  compare  stimuli  within  a  group 
or  to  compare  across  groups,  we  need  to  assume  (in  ad- 
dition to  the  normality  and  independence  assumptions) 
only  that  raters,  on  average,  were  each  consistent  in  use 
of  their  individual  rating  criterion  scales  for  the  dura- 
tion of  the  rating  session. 


Comparison  of  SBEs  With  Z-Scores  and  LSRs 

SBEs  may  be  distinguished  from  Z-scores  in  several 
ways.  First,  individual  Z-scores  are  directly  computed 
from  the  ratings  assigned  to  each  stimulus  by  each  ob- 
server. In  the  by-observer  SBE  procedure,  the  Zs  are 
derived  from  the  distribution  of  ratings  by  one  observer 
over  the  multiple  stimuli  within  a  condition.  The  propor- 
tions (actually  cumulative  proportions)  of  the  stimuli 
within  a  condition  that  are  assigned  to  each  rating 
category  are  transformed  to  Zs  using  the  inverse  normal 
integral  function,  assuming  that  those  ratings  are  sam- 
pled from  a  normal  distribution. 

In  the  by-stimulus  SBE  procedure,  the  Zs  are  derived 
from  the  distribution  of  multiple  observers'  ratings  of 
an  individual  stimulus.  The  proportion  of  observers  as- 
signing a  given  rating  category  to  the  stimulus  is  trans- 
formed to  a  Z,  assuming  that  the  set  of  observer  ratings 
was  sampled  from  a  normal  distribution  within  the  rele- 
vant population  of  observers.  Because  these  Zs  depend 
upon  the  distribution  of  different  observers'  ratings  for 
one  stimulus,  they  cannot  be  directly  compared  with  the 
Z-scores  computed  for  a  single  observer  over  multiple 
stimuli.  Of  course,  if  the  ratings  of  all  observers  of  a  data 
set  are  perfectly  correlated,  the  baseline-adjusted  mean 
Z-scores  will  be  identical  to  the  by-stimuli  SBE*s,  ex- 
cept for  the  decimal  point  which  is  two  places  to  the 
right  in  the  SBE*.  And,  if  the  baseline  is  the  full  set  of 
stimuli,  the  mean  Z-scores  will  be  identical  to  the  SBE*s, 
except  for  the  decimal  point,  as  can  be  seen  by  compar- 
ing tables  2  and  4  for  observer  groups  A,  B,  and  E.  Fur- 
thermore, under  the  condition  of  perfectly  correlated 
ratings,  mean  Z-scores  differ  from  mean  LSRs  only  by 
their  origin  (grand  mean  rating)  and  interval  size  (stand- 
ard deviation  of  mean  ratings),  and  mean  LSRs  are  iden- 
tical to  mean  ratings.  That  is,  if  ratings  of  all  observers 
within  a  group  are  perfectly  correlated,  and  if  the  base- 


line is  the  entire  set  of  stimuli, 

SBE*i/lOO  =  MZj  =  (LSRj  -  MMR)/SDMR 

=  (MRi  -  MMR)/SDMR  ^^^^ 

where 

SBE*j  =  standardized  SBE  of  stimulus  i 

MZj     =  mean  Z-score  of  stimulus  i 

LSR;    =  mean  least  squares  rating  of  stimulus  i 

MMR  =  mean  of  the  mean  ratings  assigned  to  all 

stimuli  by  all  observers  in  the  group  (grand 

mean  rating) 

SDMR  =  standard  deviation  of  the  mean  ratings  as- 
signed by  all  observers  in  the  group 
MR;     =  mean  rating  assigned  to  stimulus  i. 

Of  course,  the  ratings  of  all  observers  are  rarely 
perfectly  correlated,  so  the  relationship  between  SBE*s, 
Z-scores,  LSRs,  and  ratings  will  be  more  complex,  as  can 
be  seen  by  comparing  tables  2,  3,  and  4  for  observer 
group  D.  Theoretically,  the  SBE  metrics  would  be 
preferred  because  they  do  not  require  the  assumption 
that  observers'  ratings  constitute  an  equal-interval  scale. 
Indeed,  as  Torgerson  (1958),  Green  and  Swetts  (1966), 
and  others  have  shown,  SBE-type  metrics  computed  for 
reasonable-sized  groups  of  observers  will  be  quite  robust 
to  substantial  violations  of  the  formal  distribution  as- 
sumptions. 

Summary 

The  information  presented  above  about  the  various 
procedures  available  in  RMRATE  for  scaling  rating  data 
is  summarized  here  in  two  ways.  First,  we  review  which 
procedures  address  the  potential  problems  with  inter- 
preting rating  data.  Second,  we  discuss  when  to  use  each 
of  the  procedures. 

Scaling  Procedures  and  the  Interpretation  of  Ratings 

In  the  "Psychological  Scaling"  section,  several  poten- 
tial problems  with  interpreting  rating  data  were 
described,  which,  to  the  extent  they  exist  for  a  given  set 
of  ratings,  limit  inferences  that  can  be  drawn  about  the 
perceptions  of  the  stimuli  being  rated.  Two  of  those 
problems,  lack  of  intraobserver  consistency  and  percep- 
tual or  criterion  shifts,  can  only  be  addressed  by  proper 
experimental  design,  which  is  outside  the  scope  of  this 
paper.  The  other  potential  problems  can  all  be  reduced 
or  avoided  by  employing  a  proper  scaling  procedure. 
Those  problems  are  listed  in  table  5. 

An  X  in  table  5  indicates  that  the  respective  scaling 
procedure  somehow  addresses  the  potential  problem. 
Median  and  mean  ratings  do  not  address  any  of  the  iden- 
tified problems.  The  OAR  adjusts  for  differences  in 
criterion  scale  origin,  but  not  interval  size  differences. 
The  Z-score  procedures  adjust  for  both  origin  and  inter- 
val differences  in  criterion  scales,  assuming  that  each 
observer  is  using  an  equal-interval  criterion  scale.  Thus, 
if  it  is  important  to  adjust  for  linear  differences  between 
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Table  5.— Which  scaling  procedures  address  potential  problems  of  rating  data? 


Scaling 
procedure 


Potential  problems 


Unequal-     Linear  differences 
interval     between  observers' 
scale         criterion  scales 


Origin 
size 


Interval 


Linear  differences 
between  groups' 
criterion  scales 


Origin 
size 


Interval 


Lack  of 
interobserver 
correspondence 
(aside  from 

linear 
differences) 


Median  rating 

Raw  ratings 

OAR 

X 

BOAR 

X 

X 

Z-score 

X 

X 

BZ-score 

X 

X 

X 

LSR 

X 

X 

BLSR 

X 

X 

By-stimulus  SBE 

X 

X 

By-stimulus  SBE* 

X 

X 

By-observer  SBE 

X 

X 

X 

X 

observers  or  observer  groups,  the  Z-score  procedures 
would  be  preferred  over  raw  ratings  and  OARs. 

The  LSR  procedures  also  adjust  for  linear  differences 
between  observers  within  a  group,  and  in  addition 
weight  each  observer's  ratings  by  how  well  the  observer 
agrees  with  the  group.  However,  this  scaling  method 
does  not  adjust  for  linear  differences  between  groups. 
If  weighting  based  on  fit  with  the  group  is  desired,  and 
ratings  of  separate  groups  are  not  going  to  be  compared 
on  an  absolute  basis,  the  least  squares  rating  would  be 
preferred  over  the  Z-score  procedures. 

Only  the  SBE  procedures  adjust  for  unequal  interval 
judgment  criterion  scales.  This  advantage  is  obtained  at 
the  expense  of  combining  ratings  over  observers  or 
stimuli,  so  that  individual  scale  values  (for  each  stimu- 
lus by  each  observer)  are  not  obtained.  All  three  SBE 
procedures  adjust  for  origin  differences  among  observer 
groups,  but  only  the  by-stimulus  SBE*  adjusts  for  inter- 
val size  differences  among  groups. 

Which  Procedure  To  Use  When 

Choice  of  the  most  appropriate  psychological  scaling 
procedure  for  any  given  application  will  depend  upon 
the  design  of  the  scaling  experiment,  the  goals  of  the 
measurement  task,  and  the  extent  to  which  the  investi- 
gator is  willing  to  accept  the  assumptions  of  each  scal- 
ing procedure.  If  the  resulting  scale  values  are  to  be  used 
for  only  ordinal  comparisons,  no  assumptions  are  neces- 
sary about  the  nature  of  the  rating  scale.  In  this  case,  the 
median  is  probably  the  appropriate  scaling  procedure, 
since  the  others  would  entail  needless  complexity  for 
the  task  at  hand.  If  the  scale  values  are  to  be  used  as  in- 
terval measures  (which  is  required  for  most  standard 
statistical  operations],  choosing  among  the  mean  rating, 
computed  Z-score,  LSR,  and  SBE  procedures  will  de- 
pend primarily  upon  the  assumptions  the  investigator 
is  willing  to  make  about  the  data,  and  upon  the  desired 
features  for  the  final  scale.  The  mean  rating  and  LSR 
procedures  produce  scale  values  in  terms  of  the  origi- 


nal rating  scale,  while  the  Z-score  and  SBE  procedures 
produce  scale  values  that  are  not  easily  interpreted  in 
terms  of  the  original  scale.  There  is  no  absolute  mean- 
ing to  the  rating  values,  so  maintaining  the  scale  values 
in  terms  of  the  original  rating  scale  is  only  cosmetic. 
Nevertheless,  it  may  be  easier  to  explain  results  to  some 
audiences  in  terms  of  rating  points. 

The  mean  ratings,  Z-scores,  and  LSRs  assume  that 
each  group  of  observers  used  an  equal-interval  scale  for 
rating  the  stimuli.  The  SBE  procedure  does  not  require 
that  observers  or  groups  use  equal-interval  scales;  it 
assumes  only  that  rating  criteria  are  consistent  over  a 
rating  session,  and  that  (for  by-observer  SBE]  ratings  by 
an  observer  of  the  stimuli  within  a  condition  are  nor- 
mally distributed,  or  (for  the  by-stimulus  SBE]  the 
ratings  of  each  stimulus  by  all  observers  are  normally 
distributed. 

If  the  assumption  that  ratings  of  a  stimulus  over  mul- 
tiple observers  are  normally  distributed  is  valid,  or  at 
least  more  tenable  than  the  assumption  that  each  ob- 
server's ratings  represent  an  interval  scale,  then  the  by- 
stimulus  SBE  procedure  is  a  good  choice.  The  SBE 
procedure  also  provides  a  standard  scale,  irrespective 
of  the  number  of  categories  in  the  original  rating  scale, 
that  has  been  shown  in  theory  and  practice  to  be  com- 
parable to  scales  derived  by  other  psychophysical  pro- 
cedures (e.g.,  paired-comparisons  and  rankings].  A 
possible  disadvantage  of  the  by-stimulus  SBE  procedure 
is  that  scale  values  are  not  provided  for  individual 
observers. 

The  Z-score  procedure  is  widely  used  for  transform- 
ing distributions  to  a  standard  form  and  is  computation- 
ally straightforward.  A  possible  disadvantage  is  that 
individual  observer's  ratings  are  transformed  separate- 
ly, without  regard  to  how  other  observers  in  the  group 
rated  the  same  stimuli.  Assuming  a  linear  relationship 
among  observers'  ratings,  the  least  squares  procedure 
"fits"  each  observer's  ratings  to  the  mean  ratings  as- 
signed by  the  entire  group  of  observers,  thus  providing 
individual  scale  values  for  each  observer  that  depend  on 
the  relationship  with  the  group  ratings.  The  final  scale, 
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however,  is  dependent  on  the  number  of  categories  in 
the  original  rating  scale  and  thus  cannot  be  directly  com- 
pared (or  combined)  with  scales  derived  from  other 
rating  scales,  or  other  psychological  scaling  procedures. 
Also,  the  least  squares  procedure  incorporates  a  differen- 
tial weighting  of  observers,  which  reduces  the  natural 
variation  in  the  ratings,  in  essence  placing  more 
credence  on  some  observers  than  others,  and  may  be  con- 
trary to  the  goals  of  the  assessment. 

Unlike  the  SBE  procedures,  the  Z-score  and  least 
squares  procedures  each  provide  individual  scores  for 
each  observer  for  each  stimulus,  a  feature  that  has  some 
important  practical  advantages.  Individual  observer's 
scales  can  be  inspected  for  internal  consistency  as  well 
as  for  consistency  with  other  observers  in  the  same  as- 
sessment. Further,  the  Z-score  and  LSR  procedures,  like 
the  raw  ratings,  preserve  degrees  of  freedom  for  subse- 
quent analyses,  such  as  analysis  of  variance  to  compare 
stimuli  or  conditions,  or  correlation  and  regression 
analyses  involving  other  measures  available  for  the 
stimuli.  Having  individual  observer  values  for  each 
stimulus  also  facilitates  the  computation  of  conventional 
measures  of  the  error  of  estimate  for  individual  stimuli 
(such  as  the  standard  error  of  the  mean)  based  on  the 
variability  in  scores  among  observers.  Of  course,  this 
advantage  is  gained  at  the  expense  of  assuming  the 
individual  ratings  represent  an  interval  scale  of 
measurement. 

If  different  observers  rate  different  subsets  of  the  stimu- 
li and  rate  one  subset  in  common,  then  one  of  the  base- 
line procedures  will  be  most  appropriate.  The  resulting 
scale  will  have  an  origin  (for  the  baseline-adjusted  OAR 
and  the  SBE)  or  an  origin  and  interval  size  (for  the 
baseline-adjusted  Z-score  and  the  SEE*)  determined  by 
the  ratings  of  the  baseline  stimuli. 

Except  perhaps  for  the  median,  all  of  these  scales 
generally  produce  sets  of  scale  values  for  a  set  of  stimu- 
li that  correlate  greater  than  0.90  with  each  other  when 
individual  scale  values  are  averaged  or  otherwise  com- 
bined over  at  least  15  observers  to  produce  a  group  in- 
dex (see  Schroeder  1984).  However,  when  different 
observers  have  used  explicitly  different  rating  scales,  or 
when  individual  differences  between  observers  or  differ- 
ences in  the  contexts  in  which  stimuli  have  been  rated 
are  substantial  (e.g.,  Brown  and  Daniel  1987),  some 
transformation  of  the  original  scale  is  required. 

There  are  also  theoretical  reasons  for  choosing  a  trans- 
formed scale.  The  goal  of  the  different  scaling  pro- 
cedures is  to  provide  estimates  of  the  locations  and 
distances  between  objects  on  the  inferred  psychological 
dimension.  RMRATE  (Brown  et  al.  1990)  provides  the 
investigator  a  choice  among,  and  the  opportunity  to 
compare,  several  psychological  scaling  procedures  that 
approach  this  goal  somewhat  differently. 
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APPENDIX 


RELATIONSHIPS  AMONG  SCALE  VALUES 

Here  we  review  relationships  among  the  scaling  pro- 
cedures by  comparing  the  results  of  using  each  pro- 
cedure to  scale  ratings  from  several  hypothetical 
observer  groups.  Table  Al  lists  scale  values  of  seven 
scaling  procedures  for  five  hypothetical  observer  groups, 
each  assumed  to  have  rated  the  same  five  stimuli.  The 
baseline  for  the  SBE  is  the  entire  set  of  five  stimuli  for 
each  data  set. 

Observers  of  group  A  differ  only  in  the  origin  of  their 
ratings.  Thus,  the  origin-adjusted  ratings  (OARs)  of  the 
three  observers  in  the  group  are  identical.  Likewise,  the 
Z-scores  of  the  three  observers  are  identical,  as  are  their 
least  squares  ratings  (LSRs).  Furthermore,  notice  that  the 
mean  ratings  and  mean  LSRs  are  identical. 

Group  B,  like  group  A,  contains  observers  who  differ 
only  in  the  origin  of  their  ratings.  However,  groups  A 
and  B  differ  in  the  interval  size  of  their  respective  rat- 
ings, with  observers  of  group  B  using  a  larger  rating 
difference  than  observers  of  group  A  to  draw  distinctions 
among  the  same  stimuli.  For  example,  observers  of 
group  A  use  a  rating  difference  of  1  to  distinguish  be- 
tween stimulus  1  and  stimulus  2,  while  observers  of 
group  B  use  a  rating  difference  of  2  to  make  this  distinc- 
tion. Thus,  the  mean  OARs  of  these  two  data  sets  differ, 
as  do  the  mean  LSRs,  but  the  mean  Z-scores  of  the  two 
sets  are  identical. 

Group  E  contains  observers  whose  ratings  differ  from 
each  other  in  both  origin  and  interval  size.  Thus,  the 
OARs  are  different  between  observers  who  differ  in  in- 
terval size  (observer  3  versus  observers  1  and  2). 
However,  because  the  ratings  of  the  three  observers  are 
perfectly  linearly  related,  the  Z-scores  of  all  three  ob- 
servers are  identical,  the  LSRs  of  the  three  observers  are 
identical,  and  the  mean  ratings,  mean  OARs,  mean  Z- 
scores,  mean  LSRs,  and  SBEs  are  perfectly  linearly  relat- 
ed. Furthermore,  because  the  ratings  of  observers  of 
group  E  are  perfectly  linearly  related  to  those  of  ob- 
servers of  groups  A  and  B,  the  Z-scores  of  all  observers 
of  these  three  data  sets  are  identical,  as  are  the  mean  Z- 
scores. 

The  SBE*s  of  observer  groups  A,  B,  and  E  are  identi- 
cal, again  because  the  ratings  of  the  observers  of  each 
group  are  perfectly  correlated.  Furthermore,  because  of 


this,  the  SBE*s  are  identical  to  the  mean  Z-scores  (ex- 
cept for  the  placement  of  the  decimal  point).  Group  F 
differs  from  group  B  in  that  observer  3's  ratings  in  group 
F  are  monotonically  related  to  but  not  perfectly  corre- 
lated with  those  of  observers  1  and  2.  This  difference 
has  the  following  effects.  First,  the  OARs,  Z-scores,  and 
LSRs  of  observer  3  are  not  identical  to  those  of  observers 
1  and  2.  Second,  mean  Z-scores  and  SBE*s  of  group  F 
differ  from  those  of  group  B.  Third,  the  mean  Z-scores 
and  SBE*s  of  group  F  differ  (by  more  than  the  decimal 
point  shift)  and,  in  fact,  are  no  longer  perfectly 
correlated. 

The  ratings  of  observers  of  group  D  are  monotonic  but 
not  perfectly  correlated.  Thus,  the  OARs  of  the  three  ob- 
servers differ,  as  do  the  Z-scores  and  LSRs  of  the  three 
observers.  Furthermore,  the  mean  ratings,  mean  Z- 
scores,  mean  LSRs,  and  SBEs  are  not  perfectly  linearly 
related  (although  the  SBEs  are  perfectly  linearly  related 
to  the  SBE*s).  Note,  however,  that  the  mean  ratings  and 
mean  OARs  of  group  D  are  identical  to  those  of  group 
B.  Again,  identical  mean  ratings  do  not  necessarily 
produce  identical  Z-scores,  LSRs,  or  SBEs. 

Comparisons  for  Data  Sets 
With  a  Common  Baseline 

Table  A2  contains  ratings  for  five  hypothetical  ob- 
server groups.  The  groups  are  assumed  to  have  each  been 
randomly  selected  from  the  same  observer  population 
and  to  have  each  rated  siits  of  eight  stimuli,  each  set  con- 
taining three  common  baseline  stimuli  (indicated  by  a 
"B")  and  five  unique  stimuli.  The  nonbaseline  ratings 
of  observer  groups  II,  III,  IV,  and  V  are  identical,  but  the 
baseline  ratings  of  the  four  groups  differ.  The  nonbase- 
line ratings  of  group  I  differ  from  those  of  the  other 
groups,  but  the  baseline  ratings  of  groups  I  and  II  are 
identical.  Assuming  that  the  baseline  stimuli  of  the  five 
data  sets  are  identical,  but  the  nonbaseline  stimuli  of  the 
sets  are  unique,  baseline  adjustments  would  facilitate 
comparison  across  the  sets. 

The  baseline  ratings  of  observer  groups  I  and  II  of  table 
A2  are  identical,  but  the  nonbaseline  ratings  are  not. 
However,  the  nonbaseline  ratings  of  the  two  groups  are 
perfectly  correlated,  differing  only  in  interval  size.  As- 
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Table  A1. — Comparison  of  scale  values  for  five  observer  groups. 


Rating 

OAR 

Z-score 

LSR 

Scale  value 
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3 
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RJInnn 
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A  1 

1 

3  6 

-2.0 
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-2.0 

-1 .26 

-1 .26 
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3.33 

3.33 

3.33 

3 

3.33 

-2.00 

-1.26 

3.33 

-43 
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2 

2 

4  7 

-1,0 

-1.0 

-1.0 

-.63 

-.63 

-.63 

4.33 

4.33 

4.33 

4 

4.33 

-1.00 

-.63 

4.33 

-22 

-63 

3 

3 

5  8 

.0 

.0 

.0 

.00 

.00 

.00 

5.33 

5.33 

5.33 

5 

5  33 

.00 
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5.33 

0 

0 

4 

4 

6  9 
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1.0 
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.63 

.63 

.63 
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6.33 

6 

6.33 
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22 
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0 

0 
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7 
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7 
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43 
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5 

9 
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9 
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86 
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D  1 

1 

2  1 
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-4.4 

-4.6 
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-1 .63 
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3 

3 

7  6 
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.6 

.4 
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.10 

4.38 

5.99 
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.00 
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3 

4 

4 

5 
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7.09 

7.94 

8 
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46 

66 

5 

9 
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2.6 

4.4 

1.58 
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9 

9.33 

4.00 
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8  99 

85 
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E  1 

1 

6  1 

-2.0 

-2.0 

-4.0 

-1.26 

-1 .26 

-1 .26 

2.67 

2.67 

2.67 

1 

2.67 

-2.67 

-1.26 

2.67 

-62 

-126 

2 

2 

7  3 

-1.0 

-1 .0 

-2.0 
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-.63 

-.63 

4.00 

4.00 

4.00 

3 

4.00 

-1.33 

-.63 

4.00 

-31 

-63 

3 

3 

8  5 

.0 

0 

.0 

.00 

.00 

.00 

5  33 

5.33 

5.33 

5 

5.33 

.00 

.00 

5.33 

0 

0 

4 

4 

9  7 

1.0 

1.0 

2.0 

.63 

.63 

.63 

6.67 

6.67 

6.67 

7 

6.67 

1  33 

.63 

6.67 

31 

63 

5 

5 

10  9 

2.0 

2.0 

4.0 

1.26 

1.26 

1.26 

8.00 

8.00 

8.00 

9 

8.00 

2.67 

1.26 

8.00 

62 

126 

F  1 

1 

2  1 

-4.0 

-4.0 

-3.0 

-1.26 

-1.26 

-.95 

1.07 

1.07 

2.10 

1 

1.33 

-3.67 

-1.16 

1.41 

-80 

-119 

2 

3 

4  2 

-2  0 

-2.0 

-2.0 

-.63 

-.63 

-.63 

3.03 

3.03 

3.07 

3 

3.00 

-2.00 

-.63 

3.04 

-43 

-64 

3 

5 

6  3 

.0 

.0 

-1.0 

.00 

.00 

-.32 

5.00 

5.00 

4.03 

5 

4.67 

-0.33 

-.11 

4.68 

-6 

-9 

4 

7 

8  5 

2  0 

2  0 

1.0 

.63 

.63 

.32 

6.97 

6.97 

5.97 

7 

6.67 

1.67 

.53 

6.63 

37 

55 

5 

9 

10  9 

4.0 

4.0 

5.0 

1.26 

1.26 

1.58 

8.93 

8.93 

9.83 

9 

9  33 

4.33 

1.37 

9.23 

92 

137 

^Baseline  is  the  entire  set  of  (i.e..  all  5)  stimuli. 


suming  that  the  nonbaseline  stimuli  in  each  set  did  not 
affect  the  ratings  of  the  baseline  stimuli  (i.e.,  assuming 
that  there  is  no  interaction  between  the  ratings  of  the 
baseline  and  nonbaseline  stimuli),  the  identity  of  the 
baseline  ratings  of  the  two  data  sets  suggests  (but  of 
course  does  not  prove)  that  the  observers  of  the  two 
groups  perceive  the  stimuli  equally  and  use  identical 
judgment  criteria.  Thus,  assuming  equal-interval  scales, 
and  given  the  psychometric  model,  the  mean  ratings  of 
the  two  groups  could  reasonably  be  assumed  to  be  direct- 
ly comparable.  The  baseline-adjusted  metrics  (BOAR, 
BZ-score,  BLSR,  and  SBE*)  would  then  also  be  assumed 
to  be  directly  comparable.  For  example,  using  SBE*s, 
stimulus  1  (rated  by  group  I)  would  be  considered  as 
different  from  stimulus  3  as  stimulus  7  (rated  by  group 
II)  is  from  stimulus  8,  since  both  differences  are  indi- 
cated by  an  SBE*  difference  of  200.  However,  the  proce- 
dures that  generate  scale  values  from  a  combination  of 
the  baseline  and  nonbaseline  ratings  (the  mean  OAR, 
mean  Z-score,  and  mean  LSR)  would  produce  scale 
values  that  are  not  directly  comparable  across  data  sets; 
the  basis  of  comparison  must  be  only  the  set  of  common 
(i.e.,  baseline)  stimuli. 
Although  the  nonbaseline  ratings  of  observer  groups 

II  and  III  are  identical,  the  baseline  ratings  of  the  two 
sets  differ  in  terms  of  origin  (baseline  ratings  of  group 

III  have  a  higher  origin  than  those  of  group  II).  This  sim- 
ple difference  in  ratings  of  the  baseline  stimuli  suggests 
(given  the  psychometric  model)  that  the  two  observer 
groups  used  different  rating  criterion  scales,  and  that  the 


identity  of  the  ratings  of  the  nonbaseline  stimuli  is  for- 
tuitous. The  mean  ratings  of  the  two  sets  are  identical, 
because  the  mean  rating  computation  does  not  use  the 
baseline  ratings.  Given  the  baseline  ratings  of  the  two 
sets,  we  would  be  in  error  to  assume  that  the  mean 
ratings  of  the  two  sets  are  directly  comparable  (e.g.,  to 
conclude  that  stimulus  7  is  identical,  or  nearly  so,  to 
stimulus  12  on  the  underlying  dimension). 

The  baseline-adjusted  mean  OARs  of  observer  groups 
II  and  III,  however,  can  more  reasonably  be  compared 
(again,  assuming  equal-interval  ratings  and  the  psycho- 
metric model)  because  the  baseline  OAR  procedure  ad- 
justs for  origin  differences  among  sets  that  have  a 
common  baseline,  and,  as  we  have  seen,  the  two  sets 
differ  only  in  origin  of  the  rating  scale.  A  similar  logic 
applies  to  the  SBE  (except  that  the  assumption  of  equal- 
interval  ratings  is  not  needed).  In  addition,  the  mean 
baseline-adjusted  Z-scores  and  SBE*s  are  comparable 
across  the  two  groups,  because  these  procedures  also  ad- 
just for  origin  differences  in  the  baseline  ratings.  All 
these  metrics  (mean  BOAR,  mean  BZ-score,  SBE,  and 
SBE*)  indicate,  for  example,  that  stimulus  6  is  consid- 
ered equidistant  between  stimuli  11  and  12  on  the  un- 
derlying dimension.  But  the  mean  OARs,  or  the  mean 
Z-scores,  of  the  two  sets  are  not  comparable,  because  the 
individual  OAR  or  Z-score  transformations  are  based  on 
the  ratings  of  all  the  stimuli,  including  the  nonbaseline 
stimuli.  Also,  note  that  the  SBE*s  of  each  of  the  groups 
are  identical  to  the  mean  baseline-adjusted  Z-scores  of 
the  groups,  except  for  the  decimal  point.  This  occurs  be- 
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Table  A2.— Comparison  of  scale  values  for  five  observer  groups  that  rated  sets  of  baseline  and  unique  stinnuli. 
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cause  the  ratings  of  all  observers  within  each  of  the 
groups  are  perfectly  correlated. 

Now  examine  the  ratings  of  groups  II  and  IV.  As  in 
the  previous  comparison  (of  groups  II  and  III),  the  rat- 
ings of  the  nonbaseline  stimuli  of  groups  II  and  IV  are 
identical.  However,  the  baseline  ratings  of  these  two 
groups  differ  in  interval  size,  such  that  a  difference  of 
1  in  group  II's  baseline  ratings  appears  to  be  equivalent 
to  a  difference  of  2  in  group  IV's  baseline  ratings.  The 
mean  ratings  of  groups  II  and  IV  are  identical,  but,  as 
before,  the  baseline  ratings  suggest  a  difference  in  cri- 
terion scales  and  that  the  identity  in  nonbaseline  ratings 
is  misleading.  The  baseline-adjusted  OARs  of  the  two 


sets  are  also  identical,  as  are  the  SBEs  of  the  two  sets, 
but  these  scale  values  are  not  comparable  between  sets, 
because  the  OAR  and  SBE  procedures  do  not  adjust  for 
interval  size  differences  between  sets  of  baseline  ratings. 
Only  the  baseline-adjusted  Z-scores  and  SBE*s  of  the  two 
sets  could  (given  the  psychometric  model)  reasonably 
be  assumed  to  be  comparable,  because  these  two  proce- 
dures adjust  for  interval  size  differences  between  sets  of 
baseline  ratings. 

Next,  consider  the  ratings  of  observer  groups  I  and  IV. 
Ratings  of  all  observers  of  group  I,  including  those  of 
the  baseline,  are  perfectly  correlated  with  those  of  ob- 
servers of  group  IV,  but  (as  seen  by  examining  the  rat- 
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ings  for  the  baseline  stimuli)  the  ratings  of  the  two  data 
sets  differ  in  interval  size.  The  mean  ratings  of  the  base- 
line stimuli  suggest  that  a  rating  difference  of  1  was  used 
by  group  I  to  indicate  the  same  difference  between  stimu- 
li as  a  rating  difference  of  2  by  group  IV.  The  mean  rat- 
ings of  the  stimuli  of  the  two  groups  differ,  as  do  the 
mean  OARs.  The  mean  Z-scores  of  the  two  groups  are 
identical,  because  all  ratings  of  the  two  sets  are  perfectly 
correlated.  Likewise,  the  baseline-adjusted  mean  Z- 
scores  of  the  two  sets  are  identical.  However,  the  mean 
Z-scores  of  each  group  are  not  identical  to  the  baseline- 
adjusted  mean  Z-scores,  because  the  mean  and  standard 
deviation  for  the  standardization  are  computed  from  all 
the  ratings  for  the  mean  Z-score  and  just  from  the 
baseline  ratings  for  the  baseline-adjusted  mean  Z-score. 
The  mean  LSRs  of  the  two  data  sets  differ,  as  do  the  mean 
baseline-adjusted  LSRs,  because  the  least  squares  pro- 
cedures do  not  adjust  for  linear  differences  between  data 
sets.  Finally,  the  SBEs  of  the  two  observer  groups  are 
different,  but  the  SBE*s  of  the  two  groups  are  identical, 
and  are  equal  to  the  baseline-adjusted  mean  Z-scores, 


except  for  the  decimal  point,  because  the  two  sets  of 
ratings  are  perfectly  correlated  and  the  latter  two  pro- 
cedures adjust  for  interval  size  differences  across  sets  of 
baseline  stimuli. 

The  nonbaseline  ratings  of  observer  group  V  are  iden- 
tical to  those  of  groups  II,  III,  and  IV,  but  the  baseline 
ratings  of  group  V  are  neither  identical  to  nor  perfectly 
correlated  with  those  of  the  other  groups.  Because  the 
Z-score,  least  squares,  and  SBE  procedures  all  utilize  the 
baseline  ratings  in  the  computation  of  their  respective 
scale  values,  for  each  procedure  the  scale  values  of  group 
V  are  not  perfectly  correlated  with  those  of  the  other 
groups.  For  example,  the  correlations  of  the  SBE*s  of 
group  V  to  those  of  groups  II,  III,  and  IV  are  less  than 
1.0.  Furthermore,  the  scale  values  produced  by  the  Z- 
score,  least  squares,  and  SBE  scalings  of  group  V's 
ratings  are  not  perfectly  correlated  with  each  other,  or 
with  the  mean  ratings.  Each  procedure  deals  with  the 
lack  of  correlation  in  a  different  way.  Only  the  SBEs  and 
SBE*s  can  be  perfectly  correlated,  because  one  is  a 
simple  linear  transformation  of  the  other. 
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